{"title": "Inverse Reinforcement Learning with Locally Consistent Reward Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1747, "page_last": 1755, "abstract": "Existing inverse reinforcement learning (IRL) algorithms have assumed each expert\u2019s demonstrated trajectory to be produced by only a single reward function. This paper presents a novel generalization of the IRL problem that allows each trajectory to be generated by multiple locally consistent reward functions, hence catering to more realistic and complex experts\u2019 behaviors. Solving our generalized IRL problem thus involves not only learning these reward functions but also the stochastic transitions between them at any state (including unvisited states). By representing our IRL problem with a probabilistic graphical model, an expectation-maximization (EM) algorithm can be devised to iteratively learn the different reward functions and the stochastic transitions between them in order to jointly improve the likelihood of the expert\u2019s demonstrated trajectories. As a result, the most likely partition of a trajectory into segments that are generated from different locally consistent reward functions selected by EM can be derived. Empirical evaluation on synthetic and real-world datasets shows that our IRL algorithm outperforms the state-of-the-art EM clustering with maximum likelihood IRL, which is, interestingly, a reduced variant of our approach.", "full_text": "Inverse Reinforcement Learning with Locally\n\nConsistent Reward Functions\n\nDept. of Computer Science, National University of Singapore, Republic of Singapore\u2020\n\nDept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA\u00a7\n\nQuoc Phong Nguyen\u2020, Kian Hsiang Low\u2020, and Patrick Jaillet\u00a7\n\n{qphong,lowkh}@comp.nus.edu.sg\u2020, jaillet@mit.edu\u00a7\n\nAbstract\n\nExisting inverse reinforcement learning (IRL) algorithms have assumed each ex-\npert\u2019s demonstrated trajectory to be produced by only a single reward function.\nThis paper presents a novel generalization of the IRL problem that allows each\ntrajectory to be generated by multiple locally consistent reward functions, hence\ncatering to more realistic and complex experts\u2019 behaviors. Solving our gener-\nalized IRL problem thus involves not only learning these reward functions but\nalso the stochastic transitions between them at any state (including unvisited\nstates). By representing our IRL problem with a probabilistic graphical model,\nan expectation-maximization (EM) algorithm can be devised to iteratively learn\nthe different reward functions and the stochastic transitions between them in order\nto jointly improve the likelihood of the expert\u2019s demonstrated trajectories. As a\nresult, the most likely partition of a trajectory into segments that are generated\nfrom different locally consistent reward functions selected by EM can be derived.\nEmpirical evaluation on synthetic and real-world datasets shows that our IRL al-\ngorithm outperforms the state-of-the-art EM clustering with maximum likelihood\nIRL, which is, interestingly, a reduced variant of our approach.\n\nIntroduction\n\n1\nThe reinforcement learning problem in Markov decision processes (MDPs) involves an agent using\nits observed rewards to learn an optimal policy that maximizes its expected total reward for a given\ntask. However, such observed rewards or the reward function de\ufb01ning them are often not available\nnor known in many real-world tasks. The agent can therefore learn its reward function from an\nexpert associated with the given task by observing the expert\u2019s behavior or demonstration, and this\napproach constitutes the inverse reinforcement learning (IRL) problem.\nUnfortunately, the IRL problem is ill-posed because in\ufb01nitely many reward functions are consistent\nwith the expert\u2019s observed behavior. To resolve this issue, existing IRL algorithms have proposed\nalternative choices of the agent\u2019s reward function that minimize different dissimilarity measures de-\n\ufb01ned using various forms of abstractions of the agent\u2019s generated optimal behavior vs. the expert\u2019s\nobserved behavior, as brie\ufb02y discussed below (see [17] for a detailed review): (a) The projection\nalgorithm [1] selects a reward function that minimizes the squared Euclidean distance between the\nfeature expectations obtained by following the agent\u2019s generated optimal policy and the empirical\nfeature expectations observed from the expert\u2019s demonstrated state-action trajectories; (b) the multi-\nplicative weights algorithm for apprentice learning [24] adopts a robust minimax approach to deriv-\ning the agent\u2019s behavior, which is guaranteed to perform no worse than the expert and is equivalent\nto choosing a reward function that minimizes the difference between the expected average reward\nunder the agent\u2019s generated optimal policy and the expert\u2019s empirical average reward approximated\nusing the agent\u2019s reward weights; (c) the linear programming apprentice learning algorithm [23]\npicks its reward function by minimizing the same dissimilarity measure but incurs much less time\nempirically; (d) the policy matching algorithm [16] aims to match the agent\u2019s generated optimal\nbehavior to the expert\u2019s observed behavior by choosing a reward function that minimizes the sum of\n\n1\n\n\fsquared Euclidean distances between the agent\u2019s generated optimal policy and the expert\u2019s estimated\npolicy (i.e., from its demonstrated trajectories) over every possible state weighted by its empirical\nstate visitation frequency; (e) the maximum entropy IRL [27] and maximum likelihood IRL (MLIRL)\n[2] algorithms select reward functions that minimize an empirical approximation of the Kullback-\nLeibler divergence between the distributions of the agent\u2019s and expert\u2019s generated state-action tra-\njectories, which is equivalent to maximizing the average log-likelihood of the expert\u2019s demonstrated\ntrajectories. The log-likelihood formulations of the maximum entropy IRL and MLIRL algorithms\ndiffer in the use of smoothing at the trajectory and action levels, respectively. As a result, the for-\nmer\u2019s log-likelihood or dissimilarity measure does not utilize the agent\u2019s generated optimal policy,\nwhich is consequently questioned by [17] as to whether it is considered an IRL algorithm. Bayesian\nIRL [21] extends IRL to the Bayesian setting by maintaining a distribution over all possible reward\nfunctions and updating it using Bayes rule given the expert\u2019s demonstrated trajectories. The work\nof [5] extends the projection algorithm [1] to handle partially observable environments given the\nexpert\u2019s policy (i.e., represented as a \ufb01nite state controller) or observation-action trajectories.\nAll the IRL algorithms described above have assumed that the expert\u2019s demonstrated trajectories\nare only generated by a single reward function. To relax this restrictive assumption, the recent\nworks of [2, 6] have, respectively, generalized MLIRL (combining it with expectation-maximization\n(EM) clustering) and Bayesian IRL (integrating it with a Dirichlet process mixture model) to handle\ntrajectories generated by multiple reward functions (e.g., due to many intentions) in observable\nenvironments. But, each trajectory is assumed to be produced by a single reward function.\nIn this paper, we propose a new generalization of the IRL problem in observable environments,\nwhich is inspired by an open question posed in the seminal works of IRL [19, 22]: If behavior\nis strongly inconsistent with optimality, can we identify \u201clocally consistent\u201d reward functions for\nspeci\ufb01c regions in state space? Such a question implies that no single reward function is globally\nconsistent with the expert\u2019s behavior, hence invalidating the use of all the above-mentioned IRL\nalgorithms. More importantly, multiple reward functions may be locally consistent with the expert\u2019s\nbehavior in different segments along its state-action trajectory and the expert has to switch/transition\nbetween these locally consistent reward functions during its demonstration. This can be observed\nin the following real-world example [26] where every possible intention of the expert is uniquely\nrepresented by a different reward function: A driver intends to take the highway to a food center for\nlunch. An electronic toll coming into effect on the highway may change his intention to switch to\nanother route. Learning of the driver\u2019s intentions to use different routes and his transitions between\nthem allows the transport authority to analyze, understand, and predict the traf\ufb01c route patterns and\nbehavior for regulating the toll collection. This example, among others (e.g., commuters\u2019 intentions\nto use different transport modes, tourists\u2019 intentions to visit different attractions, Section 4), motivate\nthe practical need to formalize and solve our proposed generalized IRL problem.\nThis paper presents a novel generalization of the IRL problem that, in particular, allows each ex-\npert\u2019s state-action trajectory to be generated by multiple locally consistent reward functions, hence\ncatering to more realistic and complex experts\u2019 behaviors than that afforded by existing variants of\nthe IRL problem (which all assume that each trajectory is produced by a single reward function)\ndiscussed earlier. At \ufb01rst glance, one may straightaway perceive our generalization as an IRL prob-\nlem in a partially observable environment by representing the choice of locally consistent reward\nfunction in a segment as a latent state component. However, the observation model cannot be easily\nspeci\ufb01ed nor learned from the expert\u2019s state-action trajectories, which invalidates the use of IRL\nfor POMDP [5]. Instead, we develop a probabilistic graphical model for representing our gener-\nalized IRL problem (Section 2), from which an EM algorithm can be devised to iteratively select\nthe locally consistent reward functions as well as learn the stochastic transitions between them in\norder to jointly improve the likelihood of the expert\u2019s demonstrated trajectories (Section 3). As a\nresult, the most likely partition of an expert\u2019s demonstrated trajectory into segments that are gener-\nated from different locally consistent reward functions selected by EM can be derived (Section 3),\nthus enabling practitioners to identify states in which the expert transitions between locally consis-\ntent reward functions and investigate the resulting causes. To extend such a partitioning to work\nfor trajectories traversing through any (possibly unvisited) region of the state space, we propose\nusing a generalized linear model to represent and predict the stochastic transitions between reward\nfunctions at any state (i.e., including states not visited in the expert\u2019s demonstrated trajectories) by\nexploiting features that in\ufb02uence these transitions (Section 2). Finally, our proposed IRL algorithm\nis empirically evaluated using both synthetic and real-world datasets (Section 4).\n\n2\n\n\f2 Problem Formulation\nA Markov decision process (MDP) for an agent is de\ufb01ned as a tuple (S,A, t, r\u2713, ) consisting of a\n\ufb01nite set S of its possible states such that each state s 2S is associated with a column vector s\nof realized feature measurements, a \ufb01nite set A of its possible actions, a state transition function\n[0, 1] denoting the probability t(s, a, s0) , P (s0|s, a) of moving to state s0\nt : S\u21e5A\u21e5S !\nby performing action a in state s, a reward function r\u2713 : S! R mapping each state s 2S\nto its reward r\u2713(s) , \u2713>s where \u2713 is a column vector of reward weights, and constant factor\n 2 (0, 1) discounting its future rewards. When \u2713 is known, the agent can compute its policy\n\u21e1\u2713 : S\u21e5A! [0, 1] specifying the probability \u21e1\u2713(s, a) , P (a|s, r\u2713) of performing action a in state\ns. However, \u2713 is not known in IRL and to be learned from an expert (Section 3).\nLet R denote a \ufb01nite set of locally consistent reward functions of the agent and re\u2713 be a reward func-\ntion chosen arbitrarily from R prior to learning. De\ufb01ne a transition function \u2327! : R\u21e5S\u21e5R ! [0, 1]\nfor switching between these reward functions as the probability \u2327!(r\u2713, s, r\u27130) , P (r\u27130|s, r\u2713,! )\nin state s where the set ! ,\nof switching from reward function r\u2713 to reward function r\u27130\n{!r\u2713r\u27130}r\u27132R,r\u271302R\\{re\u2713} contains column vectors of transition weights !r\u2713r\u27130 for all r\u2713 2R and\nr\u27130 2R \\ { re\u2713} if the features in\ufb02uencing the stochastic transitions between reward functions can\nbe additionally observed by the agent during the expert\u2019s demonstration, and ! , ; otherwise.\nIn our generalized IRL problem, \u2327! is not known and to be learned from the expert (Section 3).\nSpeci\ufb01cally, in the former case, we propose using a generalized linear model to represent \u2327!:\n\n\u2327!(r\u2713, s, r\u27130) ,\u21e2 exp(!>r\u2713r\u27130\n\n's)/(1 +Pr \u00af\u27132R\\{re\u2713} exp(!>r\u2713r \u00af\u2713\n\n1/(1 +Pr \u00af\u27132R\\{re\u2713} exp(!>r\u2713r \u00af\u2713\n\n's))\n\n's))\n\nif r\u27130 6= re\u2713,\n\notherwise;\n\n(1)\n\nwhere 's is a column vector of random feature measurements in\ufb02uencing the stochastic transitions\nbetween reward functions (i.e., \u2327!) in state s.\nRemark 1. Different from s whose feature measurements are typically assumed in IRL algorithms\nto be realized/known to the agent for all s 2S and remain static over time, the feature measurements\nof 's are, in practice, often not known to the agent a priori and can only be observed when the ex-\npert (agent) visits the corresponding state s 2S during its demonstration (execution), and may vary\nover time according to some unknown distribution, as motivated by the real-world examples given\nin Section 1. Without prior observation of the feature measurements of 's for all s 2S (or knowl-\nedge of their distributions) necessary for computing \u2327! (1), the agent cannot consider exploiting \u2327!\nfor switching between reward functions within MDP or POMDP planning, even after learning its\nweights !; this eliminates the possibility of reducing our generalized IRL problem to an equivalent\nconventional IRL problem (Section 1) with only a single reward function (i.e., comprising a mixture\nof locally consistent reward functions). Furthermore, the observation model cannot be easily speci-\n\ufb01ed nor learned from the expert\u2019s trajectories of states, actions, and 's, which invalidates the use of\nIRL for POMDP [5]. Instead of exploiting \u2327! within planning, during the agent\u2019s execution, when\nit visits some state s and observes the feature measurements of 's, it can then use and compute \u2327!\nfor state s to switch between reward functions, each of which has generated a separate MDP policy\nprior to execution, as illustrated in a simple example in Fig. 1 below.\n\u2327!(r\u27130 , s, r\u27130 )\nRemark 2. Using a generalized linear model to represent \u2327! (1) al-\nlows learning of the stochastic transitions between reward functions\n(speci\ufb01cally, by learning ! (Section 3)) to be generalized across dif-\nferent states. After learning, (1) can then be exploited for predicting\nthe stochastic transitions between reward functions at any state (i.e.,\nincluding states not visited in the expert\u2019s demonstrated state-action\ntrajectories). Consequently, the agent can choose to traverse a trajec-\ntory through any region (i.e., possibly not visited by the expert) of the\nstate space during its execution and the most likely partition of its tra-\njectory into segments that are generated from different locally consis-\ntent reward functions selected by EM can still be derived (Section 3).\nIn contrast, if the feature measurements of 's cannot be observed by\nthe agent during the expert\u2019s demonstration (i.e., ! = ;, as de\ufb01ned above), then such a generaliza-\ntion is not possible; only the transition probabilities of switching between reward functions at states\nvisited in the expert\u2019s demonstrated trajectories can be estimated (Section 3). In practice, since the\nnumber |S| of visited states is expected to be much larger than the length L of any feature vector 's,\n\nFigure 1: Transition func-\ntion \u2327! of an agent in state s\nfor switching between two\nreward functions r\u2713 and r\u27130\nwith their respective poli-\ncies \u21e1\u2713 and \u21e1\u27130 generated\nprior to execution.\n\n\u2327!(r\u2713, s, r\u27130 )\n\n\u2327!(r\u27130 , s, r\u2713)\n\n\u2327!(r\u2713, s, r\u2713)\n\nr\u2713\n\nr\u27130\n\n3\n\n\fAn\n2\n\nSn\n2\n\nSn\n1\n\nt\n\nR\u2713n\n\n0\n\nR\u2713n\n\n1\n\nR\u2713n\n\n2\n\nAn\n1\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nSn\nTn\n\nAn\nTn\n\nt )Tn\n\nt )Tn\n\nR\u2713n\nTn\n\nt1, sn\n\nt where R\u2713n\n\nt 2R , an\n\nt=1, and sn , (sn\n\nn=1, and s1:N , (sn)N\n\nt 2A , and sn\nt , An\n\nt , an\nt and Sn\nt=0, an , (an\n\nFigure 2: Probabilistic graphical model\nof the expert\u2019s n-th demonstrated tra-\njectory encoding its stochastic transi-\ntions between reward functions with\nsolid edges (i.e., \u2327!(r\u2713n\nt ) =\nt1,! ) for t = 1, . . . , Tn),\nt |sn\nP (r\u2713n\nstate transitions with dashed edges\n(i.e., t(sn\nt , an\nt , sn\nt )\nfor t = 1, . . . , Tn  1), and policy\nwith dotted edges (i.e., \u21e1\u2713n\nt ) =\nP (an\n\nthe number O(|S||R|2) of transition probabilities to be estimated is bigger than |!| = O(L|R|2) in\n(1). So, observing 's offers a further advantage of reducing the number of parameters to be learned.\nFig. 2 shows the probabilistic graphical model for rep-\nresenting our generalized IRL problem. To describe our\nmodel, some notations are necessary: Let N be the num-\nber of the expert\u2019s demonstrated trajectories and Tn be the\nlength (i.e., number of time steps) of its n-th trajectory for\nn = 1, . . . , N. Let r\u2713n\nt 2S de-\nnote its reward function, action, and state at time step t\nin its n-th trajectory, respectively. Let R\u2713n\nt , and Sn\nt\nbe random variables corresponding to their respective re-\nt , and sn\nalizations r\u2713n\nis a latent variable,\nt are observable variables. De\ufb01ne r\u2713n ,\nand An\nt )Tn\nt=1 as sequences\n(r\u2713n\nof all its reward functions, actions, and states in its n-th\ntrajectory, respectively. Finally, de\ufb01ne r\u27131:N , (r\u2713n)N\nn=1,\na1:N , (an)N\nn=1 as tuples of all\nits reward function sequences, action sequences, and state\nsequences in its N trajectories, respectively.\nIt can be observed from Fig. 2 that our probabilistic graph-\nical model of the expert\u2019s n-th demonstrated trajectory en-\ncodes its stochastic transitions between reward functions, state transitions, and policy. Through our\nmodel, the Viterbi algorithm [20] can be applied to derive the most likely partition of the expert\u2019s\ntrajectory into segments that are generated from different locally consistent reward functions se-\nlected by EM, as shown in Section 3. Given the state transition function t(\u00b7,\u00b7,\u00b7) and the number\n|R| of reward functions, our model allows tractable learning of the unknown parameters using EM\n(Section 3), which include the reward weights vector \u2713 for all reward functions r\u2713 2R , transition\nfunction \u2327! for switching between reward functions, initial state probabilities \u232b(s) , P (Sn\n1 = s)\nfor all s 2S , and initial reward function probabilities (r\u2713) , P (R\u2713n\n0 = r\u2713) for all r\u2713 2R .\n3 EM Algorithm for Parameter Learning\nA straightforward approach to learning the unknown parameters \u21e4 , (\u232b, , {\u2713|r\u2713 2R} ,\u2327 !) is to\nselect the value of \u21e4 that directly maximizes the log-likelihood of the expert\u2019s demonstrated trajec-\ntories. Computationally, such an approach is prohibitively expensive due to a large joint parameter\nspace to be searched for the optimal value of \u21e4. To ease this computational burden, our key idea is\nto devise an EM algorithm that iteratively re\ufb01nes the estimate for \u21e4 to improve the expected log-\nlikelihood instead, which is guaranteed to improve the original log-likelihood by at least as much:\nP (r\u27131:N|s1:N , a1:N , \u21e4i) log P (r\u27131:N , s1:N , a1:N|\u21e4).\n\nt+1|sn\nt (sn\nt , an\nt ) for t = 1, . . . , Tn).\n\nt+1) = P (sn\n\nt |sn\n\nt , r\u2713n\n\nt , r\u2713n\n\nt , r\u2713n\n\nt , an\n\nn=1 log \u232b(sn\n\nn=1Pr\u27132R P (R\u2713n\n\n1 ) +PN\nt=1Pr\u27132R P (R\u2713n\nt=1Pr\u2713,r\u271302R P (R\u2713n\n\nMaximization (M) step. \u21e4i+1 = argmax\u21e4 Q(\u21e4, \u21e4i)\nwhere \u21e4i denotes an estimate for \u21e4 at iteration i. The Q function of EM can be reduced to the\nfollowing sum of \ufb01ve terms, as shown in Appendix A:\n\nExpectation (E) step. Q(\u21e4, \u21e4i) ,Pr\u27131:N\nQ(\u21e4, \u21e4i) =PN\n(2)\n+PN\nn=1PTn\n(3)\n+PN\nn=1PTn\n(4)\nn=1PTn1\n+PN\n(5)\nInterestingly, each of the \ufb01rst four terms in (2), (3), and (4) contains a unique unknown parameter\ntype (respectively, \u232b, , {\u2713|r\u2713 2R} , and \u2327!) and can therefore be maximized separately in the\nM step to be discussed below. As a result, the parameter space to be searched can be greatly re-\nduced. Note that the third term (3) generalizes the log-likelihood in MLIRL [2] (i.e., assuming all\ntrajectories to be produced by a single reward function) to that allowing each expert\u2019s trajectory to\nbe generated by multiple locally consistent reward functions. The last term (5), which contains the\nknown state transition function t, is independent of unknown parameters \u21e4.1\n\nt = r\u27130|sn, an, \u21e4i) \u21e5 log \u2327!(r\u2713, sn\n\nt = r\u2713|sn, an, \u21e4i) log \u21e1\u2713(sn\n\n0 = r\u2713|sn, an, \u21e4i) log (r\u2713)\n\nt1 = r\u2713, sn\nt+1) .\n\nlog t(sn\n\nt , an\nt )\n\nt , R\u2713n\n\nt , r\u27130)\n\nt , an\n\nt , sn\n\nt=1\n\n1If the state transition function is unknown, then it can be learned by optimizing the last term (5).\n\n4\n\n\fg1(\u2713) ,\n\nt , an\nt )\n\nn=1 I n\n\nP (R\u2713n\n\nt , an\n\nNXn=1\n\nn=1 P (R\u2713n\n\n1 = s, and\n\n0 = r\u2713|sn, an, \u21e4i)\n\n1 for all s 2S where I n\n\nt = r\u2713|sn, an, \u21e4i)\n\u21e1\u2713(sn\n\n1 is an indicator variable of value 1 if sn\n\nLearning initial state probabilities. To maximize the \ufb01rst term in the Q function (2) of EM, we\n\ntime, it does not have to be re\ufb01ned.\nLearning initial reward function probabilities. To maximize the second term in Q function (2) of\n\nuse the method of Lagrange multipliers with the constraintPs2S \u232b(s) = 1 to obtain the estimate\nb\u232b(s) = (1/N )PN\n0 otherwise. Sinceb\u232b can be computed directly from the expert\u2019s demonstrated trajectories in O(N )\nEM, we utilize the method of Lagrange multipliers with the constraintPr\u27132R (r\u2713) = 1 to derive\n(6)\nfor all r\u2713i 2R where i+1 denotes an estimate for  at iteration i+1, \u2713i denotes an estimate for \u2713 at\niteration i, and P (R\u2713n\nn=1 |R|2Tn)\ntime using a procedure inspired by Baum-Welch algorithm [3], as shown in Appendix B.\nLearning reward functions. The third term in the Q function (3) of EM is maximized using\ngradient ascent and its gradient g1(\u2713) with respect to \u2713 is derived to be\nd\u21e1\u2713(sn\nt , an\nt )\nd\u2713\n\ni+1(r\u2713i) = (1/N )PN\nt = r\u2713|sn, an, \u21e4i) (in this case, t = 0) can be computed in O(PN\n\nTnXt=1\nfor all \u2713 2{ \u27130|r\u27130 2R} . For \u21e1\u2713(sn\nt ) to be differentiable in \u2713, we de\ufb01ne the Q\u2713 function\nof MDP using an operator that blends the Q\u2713 values via Boltzmann exploration [2]: Q\u2713(s, a) ,\n\u2713>s + Ps02S t(s, a, s0) \u2326a0 Q\u2713(s0, a0) where \u2326aQ\u2713(s, a) , Pa2A Q\u2713(s, a) \u21e5 \u21e1\u2713(s, a) such\nthat \u21e1\u2713(s, a) , exp(Q\u2713(s, a))/Pa02A exp(Q\u2713(s, a0)) is de\ufb01ned as a Boltzmann exploration\npolicy, and > 0 is a temperature parameter. Then, we update \u2713i+1 \u2713i + g1(\u2713i) where  is the\nlearning step size. We use backtracking line search method to improve the performance of gradient\nascent. Similar to MLIRL, the time incurred in each iteration of gradient ascent depends mostly on\nthat of value iteration, which increases with the size of the MDP\u2019s state and action space.\nLearning transition function for switching between reward functions. To maximize the fourth\nterm in the Q function (4) of EM, if the feature measurements of 's cannot be observed by the agent\nduring the expert\u2019s demonstration (i.e., ! = ;), then we utilize the method of Lagrange multipliers\nwith the constraintsPr\u271302R \u2327!(r\u2713, s, r\u27130) = 1 for all r\u2713 2R and s 2 S to obtain\n(8)\nfor r\u2713i, r\u27130i 2R and s 2 S where S is the set of states visited by the expert, \u2327!i+1 is an estimate\nfor \u2327! at iteration i + 1, and n,t,r\u2713i ,s,r \u00af\u2713i , P (R\u2713n\nt = r\u00af\u2713|sn, an, \u21e4i) can be\ncomputed ef\ufb01ciently by exploiting the intermediate results from evaluating P (R\u2713n\nt = r\u2713|sn, an, \u21e4i)\ndescribed previously, as detailed in Appendix B.\nOn the other hand, if the feature measurements of 's can be observed by the agent during the expert\u2019s\ndemonstration, then recall that we use a generalized linear model to represent \u2327! (1) (Section 2) and\n! is the unknown parameter to be estimated. Similar to learning the reward weights vector \u2713 for\nreward function r\u2713, we maximize the fourth term (4) in the Q function of EM by using gradient\nascent and its gradient g2(!r\u2713r\u27130 ) with respect to !r\u2713r\u27130 is derived to be\n\nt=1 n,t,r\u2713i ,s,r\u27130i )/(Pr \u00af\u2713i2RPN\n\n\u2327!i+1(r\u2713i, s, r\u27130i) = (PN\n\nn=1PTn\n\nn=1PTn\n\nt1 = r\u2713, Sn\n\nt=1 n,t,r\u2713i ,s,r \u00af\u2713i )\n\nt = s, R\u2713n\n\n(7)\n\ng2(!r\u2713r\u27130 ) ,\n\nNXn=1\n\nTnXt=1 Xr \u00af\u27132R\n\nn,t,r\u2713,sn\n\u2327!(r\u2713, sn\n\nt ,r \u00af\u2713\nt , r\u00af\u2713)\n\nt , r\u00af\u2713)\n\nd\u2327!(r\u2713, sn\nd!r\u2713r\u27130\n\n(9)\n\nr\u2713r\u27130\n\nr\u2713r\u27130\n\n+ g2(!i\n\nr\u2713r\u27130 !i\n\nn=1 |R|2|S|Tn) time.\n\ndenote an estimate for !r\u2713r\u27130 at iteration i. Then, it is updated\nfor all !r\u2713r\u27130 2 !. Let !i\nusing !i+1\n) where  is the learning step size. Backtracking line search\nr\u2713r\u27130\nmethod is also used to improve the performance of gradient ascent here. In both cases, the time\nincurred in each iteration i is proportional to the number of n,t,r\u2713i ,s,r \u00af\u2713i to be computed, which is\n\nViterbi algorithm for partitioning a trajectory into segments with different locally consistent\n\nO(PN\nreward functions. Given the \ufb01nal estimate b\u21e4= ( b\u232b,b, {b\u2713|rb\u2713 2R} ,\u2327b!) for the unknown pa-\nrameters \u21e4 produced by EM, the most likely partition of the expert\u2019s n-th demonstrated trajectory\nt=0 ,\n)Tn\ninto segments generated by different locally consistent reward functions is r\u21e4\u2713n = (r\u21e4\u2713n\nargmaxr\u2713n P (r\u2713n|sn, an,b\u21e4) = argmaxr\u2713n P (r\u2713n, sn, an|b\u21e4), which can be derived using the\nViterbi algorithm [20]. Speci\ufb01cally, de\ufb01ne vrb\u2713,T for T = 1, . . . , Tn as the probability of the most\n\nt\n\n5\n\n\f0\n\n0 , R\u2713n\n\n0\n\nTn\n\nP ((r\u2713n\n\n)T1\nt=0\n\nt\n\n= t(sn\n\n), r\u21e4\u2713n\n\nT\n\n1\n\nt )T\nt )T\n\nt=1 and (an\n\nt )T\n\nt=1:\n\nt )T\n\nt=1, (an\n\nt )T1\n\nt=0 , R\u2713n\n\nT1, sn\nP (r\u2713n\n\nT , an\n1 = r\u2713, sn\n\nlikely reward function sequence (r\u2713n\n\nt )T1\nt=0 from time steps 0 to T  1 ending with reward function\nT = r\u2713, (sn\nT ) maxrb\u27130\n1|b\u21e4) =b\u232b(sn\n1 , an\n1 , rb\u2713) .\n1 , r\u21e4\u2713n\nvrb\u2713,Tn. The above Viterbi algorithm can be applied in the\n\nrb\u2713 at time step T that produce state and action sequences (sn\nt=1|b\u21e4)\nvrb\u2713,T , max(r\u2713n\nvrb\u27130 ,T1 \u2327b!(rb\u27130, sn\nT1, an\nT ) \u21e1b\u2713(sn\nT , rb\u2713) ,\nvrb\u2713,1 , maxr\u2713n\n1 ) \u21e1b\u2713(sn\n1 ) maxrb\u27130b(rb\u27130) \u2327b!(rb\u27130, sn\n1 , an\nThen, r\u21e4\u2713n\nvrb\u27130 ,T \u2327b!(rb\u27130, sn\n= argmaxrb\u27130b(rb\u27130) \u2327b!(rb\u27130, sn\n= argmaxrb\u27130\nT = 1, . . . , Tn  1, and r\u21e4\u2713n\n= argmaxrb\u2713\nsame way to partition an agent\u2019s trajectory traversing through any region (i.e., possibly not visited\nby the expert) of the state space during its execution in O(|R|2T ) time.\n4 Experiments and Discussion\nThis section evaluates the empirical performance of our IRL algorithm using 3 datasets featuring\nexperts\u2019 demonstrated trajectories in two simulated grid worlds and real-world taxi trajectories. The\naverage log-likelihood of the expert\u2019s demonstrated trajectories is used as the performance metric\nbecause it inherently accounts for the \ufb01delity of our IRL algorithm in learning the locally consistent\nreward functions (i.e., R) and the stochastic transitions between them (i.e., \u2327!):\n(10)\nwhere Ntot is the total number of the expert\u2019s demonstrated trajectories available in the dataset.\nAs proven in [17], maximizing L(\u21e4) with respect to \u21e4 is equivalent to minimizing an empirical ap-\nproximation of the Kullback-Leibler divergence between the distributions of the agent\u2019s and expert\u2019s\n\nL(\u21e4) , (1/Ntot)PNtot\n\nn=1 log P (sn, an|\u21e4)\n\nT +1, r\u21e4\u2713n\n\n) for\n\nT +1\n\nB\n\nA\n\ngenerated state-action trajectories. Note that when the \ufb01nal estimateb\u21e4 produced by EM (Section 3)\nis plugged into (10), the resulting P (sn, an|b\u21e4) in (10) can be computed ef\ufb01ciently using a procedure\n\nsimilar to that in Section 3, as detailed in Appendix C. To avoid local maxima in gradient ascent,\nwe initialize our EM algorithm with 20 random \u21e40 values and report the best result based on the Q\nvalue of EM (Section 3).\nTo demonstrate the importance of modeling and learning\nstochastic transitions between locally consistent reward func-\ntions, the performance of our IRL algorithm is compared with\nthat of its reduced variant assuming no change/switching of\nreward function within each trajectory, which is implemented\nby initializing \u2327!(r\u2713, s, r\u2713) = 1 for all r\u2713 2R and s 2S\nand deactivating the learning of \u2327!. In fact, it can be shown\n(Appendix D) that such a reduction, interestingly, is equiva-\nlent to EM clustering with MLIRL [2]. So, our IRL algorithm\ngeneralizes EM clustering with MLIRL, the latter of which\nhas been empirically demonstrated in [2] to outperform many\nexisting IRL algorithms, as discussed in Section 1.\nSimulated grid world A. The environment (Fig. 3) is modeled as a 5 \u21e5 5 grid of states, each of\nwhich is either land, water, water and destination, or obstacle associated with the respective feature\nvectors (i.e., s) (0, 1, 0)>, (1, 0, 0)>, (1, 0, 1)>, and (0, 0, 0)>. The expert starts at origin (0, 2)\nand any of its actions can achieve the desired state with 0.85 probability. It has two possible reward\nfunctions, one of which prefers land to water and going to destination (i.e., \u2713 = (0, 20, 30)>), and\nthe other of which prefers water to land and going to destination (i.e., \u27130 = (20, 0, 30)>). The expert\nwill only consider switching its reward function at states (2, 0) and (2, 4) from r\u27130 to r\u2713 with 0.5\nprobability and from r\u2713 to r\u27130 with 0.7 probability; its reward function remains unchanged at all\nother states. The feature measurements of 's cannot be observed by the agent during the expert\u2019s\ndemonstration. So, ! = ; and \u2327! is estimated using (8). We set  to 0.95 and the number |R| of\nreward functions of the agent to 2.\nFig. 4a shows results of the average log-likelihood L (10) achieved by our IRL algorithm, EM\nclustering with MLIRL, and the expert averaged over 4 random instances with varying number\nN of expert\u2019s demonstrated trajectories. It can be observed that our IRL algorithm signi\ufb01cantly\noutperforms EM clustering with MLIRL and achieves a L performance close to that of the expert,\nespecially when N increases. This can be explained by its modeling of \u2327! and its high \ufb01delity in\nlearning and predicting \u2327!: While our IRL algorithm allows switching of reward function within\neach trajectory, EM clustering with MLIRL does not.\n\nFigure 3: Grid worlds A (states\n(0, 0), (1, 1), and (2, 2) are, respec-\ntively, examples of water, land, and\nobstacle), and B (state (2, 2) is an\nexample of barrier). \u2018O\u2019 and \u2018D\u2019 de-\nnote origin and destination.\n\n6\n\n0123401234OD0123401234OD\f\u221224\n\n\u221227\n\n\u221228\n\n\u221229\n\n0\n\n\u221215\n\n\u221217\n\n\u221222\n\n\u221223\n\n\u221225\n\n\u221226\n\n400\n\n500\n\n600\n\n\u221219\n\n100\n\n200\n\n300\n\n\u221221\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\n1600\n\n\u221223\n\n\u221225\n\n(a)\n\n(b)\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n\n\u2212\ng\no\nl\n \ne\ng\na\nr\ne\nv\nA\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n\n\u2212\ng\no\nl\n \ne\ng\na\nr\ne\nv\nA\n\nNo. of demonstrated trajectories\n\nNo. of demonstrated trajectories\n\nt1 = r\u2713 (R\u2713n\n\nOur IRL algorithm\nEM clustering with MLIRL\nExpert\n\nOur IRL algorithm\nEM clustering with MLIRL\nExpert\n\nFigure 4: Graphs of average log-likelihood L achieved by our\nIRL algorithm, EM clustering with MLIRL, and the expert vs.\nnumber N of expert\u2019s demonstrated trajectories in simulated\ngrid worlds (a) A (Ntot = 1500) and (b) B (Ntot = 500).\n\nWe also observe that the accuracy of\nestimating the transition probabili-\nties \u2327!(r\u2713, s, .) (\u2327!(r\u27130, s, .)) using\n(8) depends on the frequency and\ndistribution of trajectories demon-\nstrated by the expert with its reward\nt1 = r\u27130)\nfunction R\u2713n\nat time step t1 and its state sn\nt = s\nat time step t, which is expected.\nThose transition probabilities that\nare poorly estimated due to few rel-\nevant expert\u2019s demonstrated trajec-\ntories, however, do not hurt the L\nperformance of our IRL algorithm by much because such trajectories tend to have very low prob-\nability of being demonstrated by the expert. In any case, this issue can be mitigated by using the\ngeneralized linear model (1) to represent \u2327! and observing the feature measurements of 's necessary\nfor learning and computing \u2327!, as shown next.\nSimulated grid world B. The environment (Fig. 3) is also modeled as a 5 \u21e5 5 grid of states, each\nof which is either the origin, destination, or land associated with the respective feature vectors (i.e.\ns) (0, 1)>, (1, 0)>, and (0, 0)>. The expert starts at origin (4, 0) and any of its actions can achieve\nthe desired state with 0.85 probability. It has two possible reward functions, one of which prefers\ngoing to destination (i.e., \u2713 = (30, 0)>), and the other of which prefers returning to origin (i.e., \u27130 =\n(0, 30)>). While moving to the destination, the expert will encounter barriers at some states with\ncorresponding feature vectors 's = (1, 1)> and no barriers at all other states with 's = (0, 1)>; the\nsecond component of 's is used as an offset value in the generalized linear model (1). The expert\u2019s\nbehavior of switching between reward functions is governed by a generalized linear model \u2327! (1)\n\nwith re\u2713 = r\u27130 and transition weights !r\u2713r\u2713 = (11, 12)> and !r\u27130 r\u2713 = (13,12)>. As a result,\nit will, for example, consider switching its reward function at states with barriers from r\u2713 to r\u27130\nwith 0.269 probability. We estimate \u2327! using (9) and set  to 0.95 and the number |R| of reward\nfunctions of the agent to 2. To assess the \ufb01delity of learning and predicting the stochastic transitions\nbetween reward functions at unvisited states, we intentionally remove all demonstrated trajectories\nthat visit state (2, 0) with a barrier.\nFig. 4b shows results of L (10) performance achieved by our IRL algorithm, EM clustering with\nMLIRL, and the expert averaged over 4 random instances with varying N. It can again be observed\nthat our IRL algorithm outperforms EM clustering with MLIRL and achieves an L performance\ncomparable to that of the expert due to its modeling of \u2327! and its high \ufb01delity in learning and\npredicting \u2327!: While our IRL algorithm allows switching of reward function within each trajectory,\n\nEM clustering with MLIRL does not. Besides, the estimated transition function \u2327b! using (9) is very\n\nclose to that of the expert, even at unvisited state (2, 0). So, unlike using (8), the learning of \u2327! with\n(9) can be generalized well across different states, thus allowing \u2327! to be predicted accurately at any\nstate. Hence, we will model \u2327! with (1) and learn it using (9) in the next experiment.\nReal-world taxi trajectories. The Comfort taxi company in Singapore has provided GPS traces of\n59 taxis with the same origin and destination that are map-matched [18] onto a network (i.e., com-\nprising highway, arterials, slip roads, etc) of 193 road segments (i.e., states). Each road segment/state\nis speci\ufb01ed by a 7-dimensional feature vector s: Each of the \ufb01rst six components of s is an indi-\ncator describing whether it belongs to Alexandra Road (AR), Ayer Rajah Expressway (AYE), Depot\nRoad (DR), Henderson Road (HR), Jalan Bukit Merah (JBM), or Lower Delta Road (LDR), while\nthe last component of s is the normalized shortest path distance from the road segment to desti-\nnation. We assume that the 59 map-matched trajectories are demonstrated by taxi drivers with a\ncommon set R of 2 reward functions and the same transition function \u2327! (1) for switching between\nreward functions, the latter of which is in\ufb02uenced by the normalized taxi speed constituting the \ufb01rst\ncomponent of 2-dimensional feature vector 's; the second component of 's is used as an offset of\nvalue 1 in the generalized linear model (1). The number |R| of reward functions is set to 2 because\nwhen we experiment with |R| = 3, two of the learned reward functions are similar. Every driver can\ndeterministically move its taxi from its current road segment to the desired adjacent road segment.\n\n7\n\n\fFig. 5a shows results of L (10)\nperformance achieved by our IRL\nalgorithm and EM clustering with\nMLIRL averaged over 3 random in-\nstances with varying N. Our IRL\nalgorithm outperforms EM cluster-\ning with MLIRL due to its modeling\nof \u2327! and its high \ufb01delity in learning\nand predicting \u2327!.\nTo see this, our IRL algorithm is\nable to learn that a taxi driver is\nlikely to switch between reward\nfunctions representing different in-\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n\n\u2212\ng\no\nl\n \ne\ng\na\nr\ne\nv\nA\n\n\u22124\n\n\u22124.5\n\n\u22125\n\n\u22125.5\n\n\u22126\n\n\u22126.5\n\n\u22127\n\n\u22127.5\n\n\u22128\n\nOur IRL algorithm\nEM clustering with MLIRL\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nNo. of demonstrated trajectories\n\n(a)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0\n \n0\n\n0.2\n\nNormalized taxi speed\n\n0.6\n\n0.4\n(b)\n\nFigure 5: Graphs of (a) average log-likelihood L achieved by\nour IRL algorithm and EM clustering with MLIRL vs. no. N\nof taxi trajectories (Ntot = 59) and (b) transition probabilities\nof switching between reward functions vs. taxi speed.\n\n \n\n)\n)\n\n\u2327b!(rb\u2713, s, rb\u27130\n\u2327b!(rb\u27130\n, s, rb\u27130\n\n0.8\n\n1\n\n(Fig. 6b) due to large rewards for traveling on them (respectively, reward weights 30.5 and 23.7).\nAs an example, Fig. 6c shows the most likely partition of a demonstrated trajectory into segments\n\nalong DR, HR, and JBM to destination. On the other hand, the reward functions learned by EM\nclustering with MLIRL are both associated with his intention of driving directly to destination (i.e.,\n\ndirectly to the destination (Fig. 6a) due to a huge penalty (i.e., reward weight -49) on being far\nfrom destination and a large reward (i.e., reward weight 35.7) for taking the shortest path from ori-\n\ntentions within its demonstrated trajectory: Reward function rb\u2713 denotes his intention of driving\ngin to destination, which is via JBM, while rb\u27130 denotes his intention of detouring to DR or JBM\ngenerated from locally consistent reward functions rb\u2713 and rb\u27130, which is derived using our Viterbi\nalgorithm (Section 3). It can be observed that the driver is initially in rb\u27130 on the slip road exiting\nAYE, switches from rb\u27130 to rb\u2713 upon turning into AR to detour to DR, and remains in rb\u2713 while driving\nsimilar to rb\u2713); it is not able to learn his intention of detouring to DR or JBM.\ntimated transition function \u2327b! using (9).\nrb\u2713 (i.e., driving directly to destination), he is\nless of taxi speed. But, when he is in rb\u27130 (i.e.,\n\nFig. 5b shows the in\ufb02uence of normalized taxi\nspeed (i.e., \ufb01rst component of 's) on the es-\nIt\ncan be observed that when the driver is in\n\nvery unlikely to change his intention regard-\n\nLDR\n\n(a)\n\nAYE\n\nDR\n\nJBM\n\nO\n\nO\n\nAR\n\nAR\n\nHR\n\nD\n\ndetouring to DR or JBM), he is likely (un-\nlikely) to remain in this intention if taxi speed\nis low (high). The demonstrated trajectory in\nFig. 6c in fact supports this observation: The\n\nDR\n\nHR\n\nD\n\nLDR\n\ndrive at relatively high speed on \ufb02at terrain.\n\nslip road exiting AYE, which causes the low\ntaxi speed. Upon turning into AR to detour to\n\ndriver initially remains in rb\u27130 on the upslope\nDR, he switches from rb\u27130 to rb\u2713 because he can\n5 Conclusion\nThis paper describes an EM-based IRL al-\ngorithm that can learn the multiple reward\nfunctions being locally consistent in differ-\nent segments along a trajectory as well as the\nstochastic transitions between them.\nIt gen-\neralizes EM-clustering with MLIRL and has\nbeen empirically demonstrated to outperform\nit on both synthetic and real-world datasets.\nFor our future work, we plan to extend our IRL algorithm to cater to an unknown number of reward\nfunctions [6], nonlinear reward functions [12] modeled by Gaussian processes [4, 8, 13, 14, 15, 25],\nother dissimilarity measures described in Section 1, linearly-solvable MDPs [7], active learning with\nGaussian processes [11], and interactions with self-interested agents [9, 10].\nAcknowledgments. This work was partially supported by Singapore-MIT Alliance for Research\nand Technology Subaward Agreement No. 52 R-252-000-550-592.\n\nFigure 6: Reward (a) rb\u2713(s) and (b) rb\u27130(s) for each\nroad segment s withb\u2713 = (7.4, 3.9, 16.3, 20.3, 35.7,\n21.5, 49.0)> andb\u27130 = (5.2, 9.2, 30.5, 15.0, 23.7,\n21.5, 9.2)> such that more red road segments\ngive higher rewards. (c) Most likely partition of a\ndemonstrated trajectory from origin \u2018O\u2019 to destina-\ntion \u2018D\u2019 into red and green segments generated by\nrb\u2713 and rb\u27130, respectively.\n\nJBM\n\nAYE\n\n(b)\n\n8\n\n(c)OARDRAYEJBMHRLDRD\fReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML,\n\n2004.\n\n[2] M. Babes\u00b8-Vroman, V. Marivate, K. Subramanian, and M. Littman. Apprenticeship learning about multiple\n\nintentions. In Proc. ICML, pages 897\u2013904, 2011.\n\n[3] J. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaus-\nsian mixture and Hidden Markov models. Technical Report ICSI-TR-97-02, University of California,\nBerkeley, 1998.\n\n[4] J. Chen, N. Cao, K. H. Low, R. Ouyang, C. K.-Y. Tan, and P. Jaillet. Parallel Gaussian process regression\n\nwith low-rank covariance matrix approximations. In Proc. UAI, pages 152\u2013161, 2013.\n\n[5] J. Choi and K. Kim. Inverse reinforcement learning in partially observable environments. JMLR, 12:691\u2013\n\n730, 2011.\n\n[6] J. Choi and K. Kim. Nonparametric Bayesian inverse reinforcement learning for multiple reward func-\n\ntions. In Proc. NIPS, pages 314\u2013322, 2012.\n\n[7] K. Dvijotham and E. Todorov.\n\npages 335\u2013342, 2010.\n\nInverse optimal control with linearly-solvable MDPs.\n\nIn Proc. ICML,\n\n[8] T. N. Hoang, Q. M. Hoang, and K. H. Low. A unifying framework of anytime sparse Gaussian process\nregression models with stochastic variational inference for big data. In Proc. ICML, pages 569\u2013578, 2015.\n[9] T. N. Hoang and K. H. Low. A general framework for interacting Bayes-optimally with self-interested\n\nagents using arbitrary parametric model and model prior. In Proc. IJCAI, pages 1394\u20131400, 2013.\n\n[10] T. N. Hoang and K. H. Low. Interactive POMDP Lite: Towards practical planning to predict and exploit\n\nintentions for interacting with self-interested agents. In Proc. IJCAI, pages 2298\u20132305, 2013.\n\n[11] T. N. Hoang, K. H. Low, P. Jaillet, and M. Kankanhalli. Nonmyopic \u270f-Bayes-optimal active learning of\n\nGaussian processes. In Proc. ICML, pages 739\u2013747, 2014.\n\n[12] S. Levine, Z. Popovi\u00b4c, and V. Koltun. Nonlinear inverse reinforcement learning with Gaussian processes.\n\nIn Proc. NIPS, pages 19\u201327, 2011.\n\n[13] K. H. Low, J. Chen, T. N. Hoang, N. Xu, and P. Jaillet. Recent advances in scaling up Gaussian process\nIn S. Ravela and A. Sandu, editors, Proc. Dynamic\n\npredictive models for large spatiotemporal data.\nData-driven Environmental Systems Science Conference (DyDESS\u201914). LNCS 8964, Springer, 2015.\n\n[14] K. H. Low, N. Xu, J. Chen, K. K. Lim, and E. B. \u00a8Ozg\u00a8ul. Generalized online sparse Gaussian processes\n\nwith application to persistent mobile robot localization. In Proc. ECML/PKDD Nectar Track, 2014.\n\n[15] K. H. Low, J. Yu, J. Chen, and P. Jaillet. Parallel Gaussian process regression for big data: Low-rank\n\nrepresentation meets Markov approximation. In Proc. AAAI, pages 2821\u20132827, 2015.\n\n[16] G. Neu and C. Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and gradient\n\nmethods. In Proc. UAI, pages 295\u2013302, 2007.\n\n[17] G. Neu and C. Szepesv\u00b4ari. Training parsers by inverse reinforcement learning. Machine Learning, 77(2\u2013\n\n3):303\u2013337, 2009.\n\n[18] P. Newson and J. Krumm. Hidden Markov map matching through noise and sparseness. In Proc. 17th\nACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages\n336\u2013343, 2009.\n\n[19] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proc. ICML, 2000.\n[20] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.\n\nIEEE, 77(2):257\u2013286, 1989.\n\n[21] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In Proc. IJCAI, pages 2586\u2013\n\n2591, 2007.\n\n[22] S. Russell. Learning agents for uncertain environments. In Proc. COLT, pages 101\u2013103, 1998.\n[23] U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming. In Proc.\n\nICML, pages 1032\u20131039, 2008.\n\n[24] U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Proc. NIPS, pages\n\n1449\u20131456, 2007.\n\n[25] N. Xu, K. H. Low, J. Chen, K. K. Lim, and E. B. \u00a8Ozg\u00a8ul. GP-Localize: Persistent mobile robot localization\n\nusing online sparse Gaussian process observation model. In Proc. AAAI, pages 2585\u20132592, 2014.\n\n[26] J. Yu, K. H. Low, A. Oran, and P. Jaillet. Hierarchical Bayesian nonparametric approach to modeling and\nlearning the wisdom of crowds of urban traf\ufb01c route planning agents. In Proc. IAT, pages 478\u2013485, 2012.\n[27] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn Proc. AAAI, pages 1433\u20131438, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1051, "authors": [{"given_name": "Quoc Phong", "family_name": "Nguyen", "institution": "National University of Singapore"}, {"given_name": "Bryan Kian Hsiang", "family_name": "Low", "institution": "National University of Singapore"}, {"given_name": "Patrick", "family_name": "Jaillet", "institution": "Massachusetts Institute of Technology"}]}