NeurIPS 2020

What Did You Think Would Happen? Explaining Agent Behaviour through Intended Outcomes

Review 1

Summary and Contributions: The paper proposes an explanation method for reinforcement learning, in which the explanation consists of the series of actions the agent intends to take from a give state-action pair (they call these belief maps). Authors show that this type of explanation cannot be obtained from Q-functions alone, and it requires keeping track of additional information during Q-learning. Their proposal consists of keeping track of the expected states-rewards visitations matrices as well, which can substantially add to the memory and computational resources needed for Q-learning.

Strengths: Given the increasing use of RL in real-world applications that affect people, providing usable explanations for RL is indeed an important and timely research problem. I appreciate the authors' focus on the concept of intention, and the (brief) connection they make to the social pscychology literature. As I explain below, the theoretical claims are sound and empirical illustrations are helpful in conveying the method.

Weaknesses: The authors briefly mention several real-world applications that can motivate their work (e.g., autonomous driving, medical robots). I wish they had mentioned these motivating cases earlier on (in the introduction) and in greater detail. Also, I believe it is important for the authors to clarify early on whether they focus on local or global explanations. Most importantly, who is the target audience of the proposed explanations? Is it RL practitioners or non-technical users of the RL model? Either way, I believe the authors should provide evidence regarding the usefulness of their proposed method for the target audience. As authors acknowledge, the proposed method is only applicable to settings where |S|x|A| is not huge. While authors outline in their concluding remarks how their approach can be utilized in settings with large state-action spaces, an empirical illustration of whether the proposal works in such settings, and what can potentially go wrong would strengthen the work.

Correctness: Authors provide easy-to-follow proofs for their main results---which appear correct to me. They provide several empirical illustrations of their approach on standard RL datasets (OpenAI Gym, Blackjack, Cartpole, and Taxi), but they don't compare their approach with prior proposals, such as [Juozapaitis et al. 2019] in terms of their practical usability for the target audience.

Clarity: The paper is well-written and easy to follow. There are, however, several minor modifications the authors can make to improve accessibility to a broader audience. * Specify what acronyms such as DQN, PIRL stand for. * In equation (1), the notation \pi^*(a|s) is confusing. \pi^* is supposed to be a function from S to A. * In equation (5), \theta^- has not been defined. * In section 4, contrastive explanations paragraph: authors seem to have a specific interpretation of contrastive explanations in RL (i.e., contrasting the intentions of two actions). The term "contrastive explanation" often refers to a broad range of explanations methods. If authors mean to say that their method can be thought of as an *instance* of this class, I'd suggest they clarify this.

Relation to Prior Work: Authors clearly explain the difference between their approach and previous RL explanation methods in their related work section, but they don't provide an empirical comparison.

Reproducibility: No

Additional Feedback: The code to reproduce the results was missing, although authors claim they provide the code on their webpage. ----Update after reading authors' feedback------- Authors have responded to most of my concerns to a satisfactory degree and therefore, I raise my score. In particular, I think the example provided in the response can simultaneously clarify several issues: how does the explanation method compare with prior ones, how can the explanations be useful in practice, and what would the explanation look like concretely. I urge the authors to include this example in the main body of the paper.

Review 2

Summary and Contributions: Update: Thanks for your response. I actually really liked the idea in this paper, and I appreciate your inclusion of figure 1 in your response, but I have 2 main issues that I think would take a substantial re-writing and more experiments to address. 1) I really didn't get from the writing that the mismatch between the agent's (implicit) model and the true environment is a major part of the paper. (Without this piece, I don't know what your method provides over forward simulating the agent's behavior, which is a pretty easy thing to do.) 2) I think really demonstrating that this is an important issue requires some more experiments. Figure 1 shows that cases can be engineered where the agent's internal model does differ from what would happen in the actual environment and that your explanation can highlight that, but I don't think it's enough to show that it's an issue in realistic scenarios. I agree with you that it probably is, but it's hard to know without more experiments. --- This paper presents an approach to explaining reinforcement learning agents through the agent's intended consequences. Through this, they aim to answer the question of what the agent thought would happen in the future when deciding on a particular action. They provide a proof demonstrating that an agent's belief map, i.e. the discounted states it expects to visit in the future, cannot be uniquely determined from the Q function. They then provide a procedure for learning the agent's belief map alongside the Q function in a way that is consistent with the Q function.

Strengths: Explaining reinforcement learning agents is an important problem, and I think explaining them in terms of what future events the agent believes will happen is a good idea! I also really liked the idea that if the post-hoc explanation is underspecified, perhaps the explanation needs to be learned alongside the agent to fully capture the nuances of its behavior.

Weaknesses: One of the key ideas wasn't clear to me: how does this kind of explanation compare to forward simulating under the agent's learned policy and recording which states are visited? Is the hypothesis that the agent's "mental model" of the true transition function is flawed and that knowing this is useful information? (In the batch setting, you may not have the transition function, but you could build a model of it based on your data. How much would this approximation affect the resulting explanation?) I could imagine there being important subtleties between these 2 approaches, but I would have found it helpful to have those hypotheses laid out clearly in the introduction, and tested in the experiments. There were a few things I thought were missing from the related work. The first is this paper: that tackles a somewhat similar problem. How does your approach compare to this? I was also curious how these belief maps compare to the successor representation: and the work building on it. Finally, I have several comments and questions about the experiments. As I mentioned above, I would like to see how this approach differs from inspecting simulations of the agent's behavior from a particular starting state + action. Generally, I would have also found it helpful to have more explicit conclusions drawn from the results. I also have some more minor questions. - In the blackjack setting, why doesn't the dealer's hand change? Is it just fixed and if so, why? I also wasn't sure how to interpret Figures 2c and 2d. - In the cart pole domain, what does it entail that the DQN estimates are fuzzier? Can we interpret anything from these beyond that the agent is doing something reasonable? It would be helpful to have examples where the belief maps show something surprising about the agent's behavior or something it would have been hard to identify without the belief map. - Finally, in the Taxi domain, I don't see the bias that the DQN agent exhibits in the figure. What should I be looking for? I think figure 6 may have a typo--descriptions for columns 3 and 4 look to be the same. I would also find it helpful to have an English description of what these columns mean. I also didn't understand why the figure in row b, 4th column has the trajectory that both agents visit highlighted. I would have expected it to have nothing highlighted since there are no squares visited in the column 1 figure but not the column 2 figure.

Correctness: For the proof of theorem 1, I didn't see how this would imply that it is never possible to produce a post-hoc interpretation given multiple optimal policies. I read this as just giving one counterexample to demonstrate that this is not always possible. Did I miss something here? Even if my interpretation is correct, I think this would still cast doubt on the usefulness of a post-hoc explanation of this kind.

Clarity: I found the main idea of the paper confusing and I think some re-writing to clarify the points I mentioned in the weaknesses would make it easier to understand. Other than that, I found the structure of most of the paper relatively easy to follow. In the results, I would have liked to see more details about how to read and interpret the results, as well as more explicit conclusions.

Relation to Prior Work: There are a few things I would have found it helpful to see compared/contrasted. I mentioned these in weaknesses.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper presents an approach to help explain the `intention' of an RL agent by collecting additional information during training. Equipped with this new information, one can form a trajectory of belief states that capture and perhaps provide agent intention. The paper provides a theorem to show that the explanation is consistent with the Q(s,a) estimate. Overall, the approach might allow further introspection and debugging of agent behavior.

Strengths: This paper is well motivated and tackles an under explored domain in interpretability. Most recent work in the interpretability area has focused on classifiers or perhaps generative models. Here the authors provide a new insight and approach for interpretability of RL agents. The first key strength of this paper is an interesting and different attack on an under explored problem. The paper also provides comprehensive assessment across different environments like pong, blackjack, and taxi. Here the belief trajectory seems to show interesting behavior for the agents. The 'intentionality' principle that motivates this work is quite interesting and hasn't been previously explored in the literature to my knowledge. Theorem 1, though straightforward, might have quite profound implications for how RL agents ought to be interpreted. The 'explanation' function 'H' defined in this work is in some sense analogous to a Q function, so H is updated in a manner identical to how an H function is also updated.

Weaknesses: While overall an interesting submission, I am still quite tentative as to what to take away from the empirical interpretations. The nature of my qualms regards how to falsify/confirm several of insights presented in that section. For example, can these insights be used to alter the behavior of an agent? If yes, then such intervention might show that the insights discussed in these sections reveal true behavior. I go into more detail on this issue in the additional feedback section.

Correctness: The theorems presented are correct (as far as I can tell). The approach presented is also correct since the authors show consistency for the H function. However, it is unclear how to verify the insights discussed in the empirical section.

Clarity: The paper is clear and well written. It is free of typos, and grammar problems.

Relation to Prior Work: The sum total of work in this area is relatively small. However, this paper does a good job of providing context on recent work on explainability in RL and how the approach presented here differs from these previous works. In general, I think the paper shows familiarity with the necessary work in this area.

Reproducibility: Yes

Additional Feedback: In this section I will go into detail about the perceived weakness that I mentioned above. Validating Empirical Insights. One issue that plagues work on interpretability is the problem of validating agent/model insights. The belief map is interesting, but I am hoping the authors can provide further evidence that these belief maps indeed demonstrate agent `intent'. Is it possible to design an agent whose intent is known a priori so that you can then compare the observed behavior to the maps that your method produces? I am not sure how this would work since the additional information for your method is also collected during training. Another way would be to use the insight from the maps to somehow change the agent behavior after observing the belief maps. This could show evidence that the maps reflect agent behavior. The upshot here is that I am hoping to understand or get at a way to sanity check the insights that we learn from the maps presented. More discussion on motivation It would be useful for the authors to discuss why the formulation presented is desired beyond other kinds. In addition, it would be useful to answer the question, what kind of interpretation can this method not provide? I am trying to get at the limits of the work here. Perhaps a specific question about what I am hoping for, why is p(s_t+n | s_t, a_t, \pi) useful for understanding the agent intent as supposed to say, p(a_t+n | s_t, a_t, \pi) (assuming this is even a computable quantity)? Minor Comment In paragraph 2, the authors say that there two groups of interpretation methods; this is not the case. Even if we restrict attention to the case of deep networks it is still not the case. There are 1) attribution methods (as was methods, i.e. saliency maps etc), 2) exemplar/input ranking methods like the work on influence functions etc, 3) there are concept methods like TCAV, and 4) there are methods that design the model class to be interpretable by design. This list is not exhaustive, but the discussion there should be amended or qualified based on the interpretation that the authors are going for. line 114 to predict to future -> to predict the future Post-rebuttal ------------------------------ Thanks to the authors for the responses and clarifying my questions. I implore the authors to respond to the issues raised by R2 on more clearly comparing their setup to simulating the agent under the learned policy.

Review 4

Summary and Contributions: The authors provide approaches for explainable RL, where information needed for explanations is collected during training. They demonstrate their approach on different RL problems. They propose a deocomposition of the Q-function over state and action space, giving detailed reasons of sub-rewards over future states. This is an addition to standard RL frameworks.

Strengths: This paper attempts to address the issue of black-box models, but enabling explainable RL. While current explanations show what in the environment drives agents to take action, the authors aim to show what the agents expects to achieve as a result of an action choice (intent-based explanations).

Weaknesses: None, however I am not an expert in RL

Correctness: The methods seem sensible and the experiments chosen make sense and are descriptive of the approach.

Clarity: The paper is well-written and clear.

Relation to Prior Work: The authors have satisfactorily covered current methods of interpretability for RL models, and pointed out where their work fits in.

Reproducibility: Yes

Additional Feedback: