__ Summary and Contributions__: The paper considers the problem of risk averse optimization with
respect to a distribution of reward functions. They take advantage of
the dual of the LP representation of value iteration in order to
define a risk averse policy optimization method for discrete
MDPs. They show how to use this to do risk averse optimization and
imitation learning in example MDPs.
Contributions: 1) a linear programming formulation to compute the
optimal policy for CVaR; 2) show how to use this to implement robust
policy optimization under a prior and robust imitation learning; 3)
demonstrate favorable comparisons with existing risk-sensitive and
risk neutral algorithms for both settings. Right now I think that the
description of the contributions hides the most useful contribution.

__ Strengths__: The paper considers an important problem: risk averse optimization of
policies under a distribution of reward functions. The paper provides
a mathematically elegant way to incorporate risk-aversion into a
standard and moderately scalable MDP solution technique. To my
knowledge, the proposed approach is novel, easy to implement, and will
improve the state-of-the-art on an important problem for the NeurIPS community.

__ Weaknesses__: While I believe that the proposed approach does scale well, I think
the evaluation from the authors is somewhat incomplete. It would be
nice to see this deployed in larger problems. The authors mention that
they are somewhat limited by the size of problems that they can run
inference in --- I think this is not true. It is possible to test this
algorithm as a planning method in large problems, either with a
hand-coded prior distribution over rewards or by transferring a
learned reward function to a new environment (this is where this
method should do quite well.) As it stands, the reader of this paper
is left to guess at the scalability of this approach based on general
knowledge about LP solvers.

__ Correctness__: I believe the claims and the method are correct. I think that the
empirical methodology is sound, but somewhat weak. I would like to see
the authors demostrate that this method works for substantially larger
planning problems. Even if it is only over larger problems with
hand-specified distributions over reward functions.
I have one concern, which is that the robust implementation of the
method (4.d) doesn't match the demonstration for one state. I think
this is because the planner is actually indifferent between two
actions based on the inferred reward, but it would be good to see some
discussion of this.

__ Clarity__: Overall, I think the paper is fairly well written. I understand that
the authors are working within the page restrictions of the conference.
With that said, I think there is substantial room for improvement in
the paper presentation. First, I think there are more specific ways to
describe the contributions (copied from summary): 1) a linear programming formulation to
compute the optimal policy for CVaR; 2) show how to use this to
implement robust policy optimization under a prior and robust
imitation learning; 3) demonstrate favorable comparisons with existing
risk-sensitive and risk neutral algorithms for both settings. Right
now I think that the description of the contributions hides the most
useful contribution.
Second, the paper was fairly notation and information dense. I
recognize this is hard to account for, but I think it is worth
revising the paper from the standpoint of remove as much notation as
possible and including frequent redefintion of notation. I found
myself flipping between initial definitions and subsequent use of
notation a fair amount. As an example of potentially redundant
notation, the paper uses $u$ for state frequencies and $\mu$ for
expected feature counts and switches back and forth between use $R$,
the state-by-state reward estimates and $W$ which are the weight
estimates for reward functions. These are equivalent encodings, and so
the paper might be simpler if the authors picked one.
For example, consider the start of section 4 (l.172-178). The second
sentence defines $\psi$ in a subclause (which is missing a "," at the
start) and then defines the central optimization problem. These are
both important concepts that should be defined on their own and
highlighted to the reader. I would break this into many more
sentences: "Let ... policies. Let ... functions. Define a performance
measure $\psi$ as.... This is useful because.... With this definition,
we can define our risk averse optimization: ...."
In l.177, the authors use $\rho$ and rely on the reader remembering
the definition from the start of the previous section. While this is
somewhat small, I think it adds up over the course of the paper.
Finally, I think the authors should move some of their examples
further up in the paper. The paper describes risk aversion well, but I
think that the authors could bring examples higher up in the paper to
give an intuitive sense of why this is valuable. For example,
including some parts of Fig. 4 earlier on to help motivate the reader
to work through the notation will make the paper much more readable.

__ Relation to Prior Work__: Yes, I think this is fairly clear. One area to improve is l.177, where
is isn't clear that the state-occupancy distribution formulation of
the LP comes from prior work. (In particular, instead of "We make use
of the one-to-one..." consider "We build on the classic LP formulation
of MDP planning, which optimizes the state occupancy distribution
subject to flow constraints" or something similar.)

__ Reproducibility__: Yes

__ Additional Feedback__: Overall, I think this paper is an OK presentation and evaluation of a _really_ neat and potentially influential idea. I think that the paper should be accepted and I am open to revising my score upwards if the authors can make substantial improvements on presentation/evaluation.
Update after discussion phase: my position is unchanged --- this is a good paper that should be accepted.

__ Summary and Contributions__: This paper describes a method for imitation learning in the paradigm
of combining an IRL method with a planning method. It deviates from
previous methods by using a risk-sensitive planning method that is
adjustable in its risk sensitivity and hence more versatile than fully
adversarial or fully risk-neutral approaches.

__ Strengths__: Adjustable risk-sensitivity seems vital for many applications of
imitation learning, and it is currently understudied.
I found the example on the bottom of page 1 fairly convincing support
for the claim that application-specific risk-sensitivity is valuable,
in that is satisfies some real need on the part of the developer.
The solution method is fairly clean and intuitive as a solution to a
linear program.
The experiments, while toy, do demonstrate the value of the approach
over either extreme and over previous approaches.
The paper is useful because of its careful analysis, strong motivation,
and informative discussion of related work.

__ Weaknesses__: The experiments are still a bit toy and small-scale. The paper would
be significantly more convincing with human experiments or experiments
in larger domains.
I was not convinced that conditional value-at-risk is the best
approach for adjustable risk-sensitivity, though it is fairly natural
and has support in the previous literature.
The key issue is how to handle misspecification of the Bayesian prior.

__ Correctness__: As far as I can tell the methods follow well-known patterns and seem
correct. The experimental design was convincing -- avoiding a focus
on a particular base reward function keeps the analysis fair and
clear.
Minor nitpicks:
l.137 "This is a strict generalization of the standard MDP formalism"
- this seems an unnecessary and possibly confusing remark. The
standard MDP formalism says nothing (as far as I know) about the form
of the reward function R(s,a,s'). It need not be a table.

__ Clarity__: Overall I found the paper very clear and well written, specifically:
- The example at the bottom of the first page was very helpful in
motivating the topic.
- Table 1 was helpful in understanding the relation to other work.
- The experimental analysis was clear. The choice not to analyze the
effects with regard to a particular "true" reward function was
effective in avoiding possible confusion.
- Sorting the reward functions in Figure 5 was a good idea.
Minor comments:
l.25 demonstrator’s values -> demonstrator’s preferences [or, objectives]
l.28 "to learn" -> "it to learn"
l.83 "Bayesian IRL infers a posterior distribution over which rewards are most likely, but only optimizes for the mean" - this might be true in these papers, but
obviously it is not the true Bayesian way or the true meaning of "Bayesian IRL"
l.107 GAIL -> and GAIL
l.110 between -> among
l.336 it’s -> its
References: many problems with over- and under-capitalization of titles, names, acronyms (Aaai, Icml), etc.
NIPS-> NeurIPS
Source for [25]?

__ Relation to Prior Work__: I think this was very well done overall. The care taken with
comparative analysis is impressive.
The distinction between this paper and [8] "Efficient probabilistic
performance bounds for inverse reinforcement learning." could be
emphasized a bit more. When it was first mentioned the difference
seemed minor because I did not realize the significant difference
between value at risk and conditional value at risk, nor the stark
difference in the ability to optimize for these metrics. Making this
more explicit on line 95 could avoid readers believing that this is a
small change on previous work.
There is a bit of recency bias:
- The older citations for imitation Learning at the beginning should
be added. Imitation learning didn't start in 2018.
- Risk-sensitive planning under uncertainty is much older than the
2010s. See, e.g.,
S. Koenig and R. G. Simmons. Risk-sensitive planning with
probabilistic decision graphs. In Proceedings of the Fourth
International Conference on Principles of Knowledge
Representation and Reasoning (KR-94), pages 2301–2308,
1994.
- Perhaps look at older work on robust RL (as distinct from solving
robust MDPs), noting that the standard RL update rules are
certainty-equivalent. See, e.g.,
Jun Morimoto, Kenji Doya, Robust Reinforcement Learning
NIPS 2001 and Neural Computation 2005.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper studies imitation learning through a Bayesian perspective. Previous work mostly focuses either in risk-neutral IRL (by expected reward maximization) or through a minmax perspective (adversarially). This paper proposes a new objective for learning: weighted combination of expected return and conditional value at risk. Under linear reward assumption, the objective reduces to a linear program and could be solved efficiently. However, this needs a prior or posterior of the reward function as an input. In summary, this work provides a bridge between the previous two extreme frameworks, and achieves better empirical performance on some simple domains.

__ Strengths__: I like the idea of balancing the risk and return in the imitation learning objective, using a hyper-parameter $\lambda$. The algorithm is well-stated, and under linear reward assumption, the method reduces to a simple LP and is computationally efficient. The empirical performance is also examined in two simple domains, to illustrate how $\lambda$ guides the learned policy from risk-neutral (but potentially high variance) to risk-averse. Also, in these two domains, the proposed method achieves better learning performance than baseline methods.

__ Weaknesses__: The method is easy to understand and implement, however I feel the success of the method relies heavily on the following assumptions, which the authors do not justify well:
- The method seems rely heavily on the quality of prior/posterior distribution of the reward, any justification (theoretically/empirically) how the method performs under misspecification of the reward distribution?
- The linear structure of the reward brings computational convenience, however it is hard for reward to satisfy this structure in real-world applications. Any justification for this? Or empirically examine how the method performs under some complex domains?
- The optimization objective seems have SA variables in total, which becomes huge in large state space. Any generalization of the method when we could parametrize the policy?
Also, another concern from the computationally perspective: it may not be easy to get a prior/posterior distribution in real-world applications, it may affect the use of the proposed method.
Empirically, the authors examine the performance of the model in two simple domains, and it will be great to see more experiments from some complex domains in deep RL domain or health care domain.

__ Correctness__: The proposed algorithm is correct and the empirical methodology seems reasonable.

__ Clarity__: The paper is well-written and easy to follow. One minor improvement may be incorporate more information about how to get the reward distribution and the complexity of it, to make the paper more self-contained.

__ Relation to Prior Work__: The paper gives clear comparison with previous imitation learning method in Table 1. However I am not that familiar with the bayesian imitation learning literature, and could not give a judgement about how thorough it is.

__ Reproducibility__: Yes

__ Additional Feedback__: The comments and suggestions are shown in the "Weakness" part, and I am willing to change my score if they are well-justified.

__ Summary and Contributions__: This paper presents Bayesian Optimization for Imitation Learning (BROIL). The main motivation for the method is to learn a policy that is robust with respect to uncertainty that arises from inverse reinforcement learning. BROIL is framework for optimization of a policy that minimizes baseline regret given demonstrations from experts. They compare their approach to existing risk sensitive inverse reinforcement learning algorithms and claim improvements in expected return.

__ Strengths__: Overall the paper is sound, straightforward, and meets its well-stated objectives.
- The paper is easy to read, and the work has been well situated in the context of other related work.
- The BROIL formulation is natural to the extent that I am surprised it is indeed novel in the context of RL.
- There’s a good potential to extend this work to more complex domains

__ Weaknesses__: It’s hard to assess the implications of this work. There are many possible claims of novelty:
- This is the first work to extend adapt the concept of conditional value at risk to the inverse reinforcement setting
- The BROIL formulation, which seems to be just a linear combination of expected reward with CVar
- The fact that CVar and BROIL can be formulated as a linear program.
Which of these are the claims of the paper?
The second worry is that the scope of the method is unclear, and consequently, so is the utility of the method in practice. The example are small, which is not a problem in its own right, but we do not get an intuition about the limiting factors. Is it the in the loop MDP?
The method is parameterized on having samples from the posterior, which they generate using MCMC. Assuming the samples are not exact, how do approximations in the MCMC method propagate onto optimization of BROIL?
The method is presented in the context of inverse RL. Has it already been addressed within ordinary (non-inverse) RL? There are several examples of RL problems where taking risk into account is critical. In some sense, the inverse RL/imitation learning seems a bit superfluous to their method.

__ Correctness__: They appear correct.

__ Clarity__: The paper is clearly written and well structured.

__ Relation to Prior Work__: The paper makes clear it’s relation to existing work.

__ Reproducibility__: Yes

__ Additional Feedback__: