NeurIPS 2020

Deep Inverse Q-learning with Constraints

Review 1

Summary and Contributions: UPDATE: I thank the authors for their feedback. It seems to me that the authors tried to resolve most of the mentioned issues. Among them, they explain the possible confusion about the boltzmann distribution, the most common criticism of all other reviewers. Ultimately, the decision will depend on other reviewers' stance. The paper presents a new approach to Inverse Reinforcement Learning (IRL) in four algorithms - tabular and model based IAVI, tabular model-free IQL, deep model-free DIQL and finally DCIQL with constraints. The main benefit of the presented algorithms is their order-of-magnitude faster convergence to better results. The authors present definitions of the algorithms and benchmark them in two domains, ObjectWorld and autonomous driving SUMO, making comparison with a related algorithm (Deep) MaxEnt. This is a very clear paper with strong results and high significance, hence I strongly advocate its acceptance.

Strengths: The paper is extremely well written, with elaborated comparison to related work. It reads with ease, mainly due to the ordering of the presented concepts. First presented algorithm results in a system of linear equations that are solvable in closed-form for some domains or numerically (e.g., for infinite MDPs with no terminating states). The subsequent algorithms gradually build on top of each other, making their definition very clear. The algorithms avoid an internal loop of prior algorithm MaxEnt, and as such they are order-of-magnitude more efficient. The first experiment provide evidence for the speed-up, and also stable convergence to a near-precise solution (even in the case of deep DIQL). The second experiment showcases the possibility of incorporating contraints with unconstrained expert demonstrations. Mainly due to the speed-up and convergence properties, the presented algorithms are of high interest to the community. I also appreciated the included video briefly summarizing the paper.

Weaknesses: I found none.

Correctness: Seems correct.

Clarity: Very clear.

Relation to Prior Work: Yes, thoroughly.

Reproducibility: Yes

Additional Feedback: The supplementary material does not contain the source code. For reproducibility, I would appreciate publishing the source code later.

Review 2

Summary and Contributions: [UPDATE] I thank the authors for their response. I agree that the empirical results (for the settings considered in the main paper and the additional results provided in the rebuttal) are convincing/impressive. But in the theoretical/algorithmic front, I'm still not convinced. Especially: [1] lines 4-16 in the rebuttal: I still think that Theorem 1 imposes strong restriction of the class of MDPs (even if it relaxes the restriction on the expert policy distribution): not necessarily all the MDPs should satisfy such condition over long-term Q-value. Consider an example: action space that contains only two actions A = {a,b}, state s, and a greed/deterministic expert policy s.t. \pi(s) = a; then the condition in Theorem 1 is Q^*(s,a) = Q^*(s,b) + \infty ; I don't know how to interpret this condition and how would it affect the following analysis in Section 3. I believe that this needs more discussion to fleshout broader applicability of their claims. [2] lines 26-35 in the rebuttal: this explanation is only partly clear to me. It is better to have explicit formal statements/theorems on covergence guarantee for sections 3 and 4 (as I believe that these two sections are main contributions of this paper; sections 5 and 6 are valued added extensions based on the standard tricks from the literature). Based on the above two points, I expect that it woud require substantial amount of work to revise and would warrant another round of discussion/reviews. ============================= This paper aims to address a fundamental computational problem in standard inverse reinforcement learning algorithms (e.g., MaxEnt): exact computation of the optimal policy w.r.t. current reward parameter at every iteration. To this end, this paper proposes a new algorithm, named inverse-action value iteration, that has to solve a system of equations only once for each state. But their proposal depends on two assumptions: (a) expert policy follows a certain probabilistic form, and (b) MDP should have terminal states and an ordering property. They have provided a sampling variant of this algorithm: inverse q-learning. Then, they extended this algorithm to high dimensional setting using the standard from DQN (e.g., replay buffer, target network, etc.). Finally, they have also a constrained MDP setting as well.

Strengths: The paper proposes a novel algorithm for the IRL problem with the potential to reduce the number of exact policy evaluation steps. The paper is fairly well written, and illustrative experimental results are provided. Reproducibility: Hyperparameter details are clearly documented in Appendix C and D.

Weaknesses: I consider the following as the limitations of/concerns about this work: 1/ What would happen to your strategy if the expert demonstrations are not from Boltzmann policy. Say that the expert policy is soft-optimal policy as in MaxEnt, then the only change in Eq (1), in the place of optimal Q^*, you would use soft-optimal Q^soft (this satisfies soft Bellman equation). How would this affect your strategy in section 3? 2/ At the end of section 3, you state that “In case of an infinite control problem or if no clear reverse topological order exists, we solve the MDP by iterating multiple times until convergence.” Is this procedure guaranteed to converge, and reach policy matching? 3/ Does the tabular inverse Q-learning algorithm guaranteed to converge? What theoretical guarantees you can provide for this algorithm? 4/ Regarding Algorithm 1, line 5: if you compute the state-action visitation frequency (by traversing through the expert data) beforehand, wouldn’t it improve the accuracy of the expert policy estimate in line 6? 5/ In the experiments (section 7.1) are the expert demonstrations generated from: (a) optimal policy w.r.t. true reward or (b) Boltzmann policy as given in Eq (1). If (b) is the case, the comparison of MaxEnt with IQL (your algorithm) may not be fair. Because MaxEnt IRL implicitly assumes that the expert policy is optimal w.r.t. unknown true reward (and you are also evaluating the final performance w.r.t. this reward). Thus to satisfy the assumption required for IQL, you’re violating the requirements for MaxEnt. Have you empirically tested the performance of IQL when the required assumption is not satisfied, i.e., demonstrations are according to (a)?

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: update: rebuttal read. Overall, a good paper, but can be clarified in several places (see reviews). --- The authors reduce the IRL problem (with rewards + constraints) to a recursive linear problem over r(s, a) for rollout data collected from a Boltzmann distribution over the optimal Q-value function. Assuming terminal rewards are known, in a special case, the non-terminal rewards r(s, a) can be inferred. The authors formulate their approach in both the tabular and function approximation setting. They show mostly in the latter setting that their method get closer to the true r(s,a) / violates fewer constraints than baselines.

Strengths: The paper is written clearly and the diversity in experiments is nice.

Weaknesses: My main concern is how feasible this approach is in general. What if the data didn't come from the optimal (Boltzmann distribution) policy? What if there is a significant (suboptimal) bias in the policy? What if the optimal policy is multimodal (e.g., going left or right doesn't make a difference)? How robust is this method to discretizing continuous state-action spaces?

Correctness: Yes, as far as I can tel.

Clarity: Yes.

Relation to Prior Work: I'm not very familiar with all the recent IRL work, so I cannot judge this definitively.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: Update: The rebuttal clearly addressed my concern on the restriction of Boltzman Distribution Assumption. However, R2 raised a new question [1]. I think the author needs more careful discussion about the proof of Theorem 1 in their rebutall, i.e. what if \pi(a) is zero for some actions, so that log(\pi(a)) is not defined. I strongly recommend the author reviewed their proof about this issue and add the discussion they had in the rebuttal to the main paper. I'd like to raise my rating to accept after reading the rebuttal. =================================================== Proposing an efficient IRL method that does not require costly inner-loop for estimating value functions, by exploiting an assumption that expert demonstration is following boltzman distribution of an underlying reward/value function. Experiments on simulators demonstrate its effectiveness compared to popular IRL baselines.

Strengths: An interesting solution backed up with math derivations. The authors also discuss for different situations/assumptions such as constrained IRL which I think provides insights to the community.

Weaknesses: I'm interested in the assumption that expert demonstration is generated under a boltzman distribution. Although this covers a broad range of problems, there're many other problem which may violate this assumption. For example, the expert may simply choose the max-reward action rather than sample actions from a distribution. I'm curious for the later case, do the authors think it's doable to adapt their method to this type of problems?

Correctness: Yes.

Clarity: I think in general the math part is a little bit tough to read. I hope the author can potentially make it more lightweight and provide more intuition / transition if possible.

Relation to Prior Work: Clear to me

Reproducibility: Yes

Additional Feedback: