__ Summary and Contributions__: This paper proposed a theoretical RL framework to solve constrained MDP problems under tabular and episodic settings. The distinguishing characteristic is that the paper focused on sample-efficient exploration, which is rarely related with constrained MDPs before.

__ Strengths__: Overall this paper has a great motivation, and provides a thorough mathematics reasoning for the proposed model. This could be the start of a novel research area for solving constrained MDPs.

__ Weaknesses__: My major concerns:
1. line 248 suggested linear programming could be used in ConPlanner, but instead the experiment tested on different unconstrained RL planners under Lagrangian heuristic. I think the papers should have compared results of different constrained problem solver.
2. The paper gave a formulation for knapsack setting, but didn’t test it on any real-world problems. While theoretical proof was plenty, the paper didn’t provide any empirical support, making this method less intuitive.
3. Although the paper claimed they compared the proposed framework with other concave-convex approaches, the problems they experimented on didn’t seem to be concave-convex. Grid world problem such as Mars rover applied in the paper has linear constraints instead of convex ones. On the other hand, in line 270, the paper suggested most previous methods are limited to linear constraints except one. Understandably, It is hard to find previous approaches or benchmarks to compare with under concave-convex settings. The paper should just state them as they were.
4. The vanilla reward function is simply the expected sum of reward, aka, f is nothing but an identity function. Is there any scenario we need to use a complex f function? The paper doesn't seem to offer such examples in the experiment part.
5. In the related work section, the paper mentions that the closest work are (Singh et al., 2020; Efroni et al., 2020; Qiu et al., 2020; Ding et al., 2020). As pointed out, one key difference is that these concurrent work are restricted to linear obj and linear constraint case. A more detailed comparison is welcomed to clarify. For example, what will the result be like when this paper reduces to linear obj and linear constraint? Is there any possibility of extending the concurrent papers' analysis methodology to the convex-concave case?
Some minor problems:
line 126&178: it would be better if paper add simple statement at where modularity comes in play for both basic and concave-convex setting
line 202: in Theorem 4.1 it would be better if there is an explanation of what Ld is.
line 206: it would be clearer if paper briefly explains how sublinear aggregate reward regret is related to policy being optimal.
line 257: there should not be any reward when rover crashes into a rock. I believe it is a typo.
line 269: a typo”previous”

__ Correctness__: The claims, method, and empirical studies look reasonable to me.

__ Clarity__: The paper is well-written.

__ Relation to Prior Work__: As discussed above, the comparison with concurrent related work should be elaborated.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Authors considered a constraint MDP problem in episodic setting. They proposed a model based method, that learns the transition, reward, constraint with MLE estimate, and form an optimization problem with it. For strategic exploration they used OFU and given the concentration inequalities they proved optimism. Further they extended their algorithm to convex-concave and knapsack setting. It's important to note that the method is for tabular setting and optimization problem is convex in occupancy measures, so efficiently solvable. Additionally they performed some experiments.

__ Strengths__: Theory : paper is very strong theoretically (although I have some question later), proofs are interesting and for convex-concave and knapsack setting non-trivial. However I found the proofs of basic algorithm fairly trivial.
Motivation : I strongly believe this is a very important problem, and of value to the community, and has been an under-explored part of the field.

__ Weaknesses__: Writing : I find the paper hard to follow. In order to understand the paper I had to go back and forth to the appendix. I believe authors can put the experiments in the appendix (where for me is of lower importance) and spend more time describing the proofs in the main text and make the intuition clear. (But I understand that field at this point requires a lot of experiments, so I wait to hear other reviewers perspective before making a suggestion here but for me this paper is more of a theoretical contribution)

__ Correctness__: I checked the proofs, and I believe they are correct to the best of my knowledge.

__ Clarity__: As I mentioned, I think writing can be improved, and having to go back and forth to appendix is not optimal. More intuition on how proofs are done is needed in the main text. For example "We apply mean-value theorem to show that this is the case" : This gives almost no intuition on how the proof has been written.

__ Relation to Prior Work__: I am not an expert in this field, but to the best of my knowledge authors have done a good job connecting the dots from bandits literature.

__ Reproducibility__: Yes

__ Additional Feedback__: I have some additional questions regarding theory,
1. PAC bound : I wonder why authors don't represent a PAC bound? Is there a theoretical road block, and if so what needs to be proven for that, and why is it hard?
2. Knapsack Solver: I couldn't find the solver for this, is it also going to be convex? if not, that's a main lacking point of the paper.
3. Scalability : As I said, I believe theoretical analysis by themselves are worth being put out for the community, but I wonder if authors thought about non-tabular setting? And how to scale up the method to larger state spaces, as most practical aspect of the work are not tabular.
## Read authors feedback, and I keep my score as it was.

__ Summary and Contributions__: The paper studies an online MDP problem with concave rewards and convex constraints in an episodic setting, where the model parameters on the rewards and resource consumption are not known. The authors proposed an optimism based algorithm, and quantify the performance guarantee in terms of the reward regret and the constraint regret. The main result is sqrt{K} expected regret bounds for those two regrets. The authors also studied a hard constraint version of the problem, where the resource constraints must be satisfied always.

__ Strengths__: The results incorporate general concave rewards and convex constraints, which provide a general model to capture a variety of applications. Personally, I feel that the notion of aggregate rewards for RL has been long neglected, in view of its numerous applications in RL exploration. This work fills in the gap in the case of episodic MDP, which makes myself to be slightly in favor of acceptance than reject. The regret bounds essentially match those for the scalar reward setting, and the authors also point out the non-trivial steps in generalizing from (Agrawal and Devanur 2011) to the current episodic MDP setting.

__ Weaknesses__: 1. While in the "Strengths" I remarked that regret bounds essentially match those for the scalar reward setting, I feel that there is still some gap in the bound. In Theorem 4.1, the consumption regret upper bound of "Ld · CONSREG" has a rather loose dependence on d, as in it does not match the dependence of \| 1_d \| in (Agrawal and Devanur 2011). Here, I denote 1_d as the d-dimensional all 1 vector, and g is L Lipschitz continuous with respect to the norm \|\cdot\|.
2. In the main results, the authors show that the constraint regrets are bounded in expectation. Do these bound translate to a high probability bound, in the same way as (Badanidiyuru 2013, Agrawal and Devanur 2014, Cheung 2019)?
3. Another place that needs substantial improvement is the numerical experiments. While the authors have provided additional details for the numerical experiments in Appendix E, there are quite a few places that require clarification:
- How does the plots in Figure 1 corroborate with the regret bounds? More precisely, it is not clear how a trajectory means in the online model. For example, if I look at the top left plot for RCPO, it reports that at No. of trajectories = 500, the reward is \approx 1.65. What does it mean? Does it mean that if I run the RCPO with 500 episodes than the empirical average reward realizes as 1.65, or does it mean something else?
- Can the authors provide plots about the cumulative regret of the algorithms (at least for the proposed algorithms, in case they don't make sense for existing algorithms like RCPO for some reason)? This provide a more direct way to empirically evaluate the proposed algorithms.
- In relation to the previous point, in footnote 7, it remarks that the bottom row corresponds to "the aggregate actual constraint incurred during training". However, in an online model, how is the notion of "during training" defined?
- It is not clear why the authors include A2C, which requires the access to the latent model on p as shown in the Appendix (so it seems to violate the model assumption of not knowing p for example?)
- I am in fact quite confused by column C. Is there any reason why TFW-UCRL2 is run with significantly more trajectories than the other two algorithms ConRL-A2C and ConRL-Value Iteration?
- When I tried to dig deeper in Appendix E, it is stated that “TFW-UCRL2 gives fixed weights to reward and constraint violation and maximizes a scalar function. Therefore we tuned TFW-UCRL2 for different weights and the reported result is for the best weight.” Nevertheless, to my knowledge, the TFW-UCRL2 in fact assign dynamic weights () to the penalty function for the constraints, and those weights do not need tuning in the sense that the dynamic update of the weights Can the authors provide a high level sketch on how they implemented the TFW-UCRL2 in the online episodic setting?

__ Correctness__: The theoretical results are correct to my knowledge. I am not entirely sure about empirical methodology, and I look forward to the reply by the authors on the questions in "Weakness".

__ Clarity__: Apart from the numerical results, other parts are well written. I strongly advice the authors to provide more details (at least enough details to address the questions in the "Weakness") in the appendix?
Some minor comments for polishing the paper:
Line 192: The convexity in occupation measure might not be immediately clear to readers unfamiliar with (Agrawal and Devanur 2011). The authors could make reference to the relevant Lemma in that reference.
Line 269: prevnous -> previous

__ Relation to Prior Work__: To my knowledge, the immediately relevant literature, in particular the concurrent results (Singh et al., 2020; Efroni et al., 2020; Qiu et al., 2020; Ding et al., 2020), are clearly discussed. The authors could also include the following reference that has a similar setting to (Cheung 2019):
Jean Tarbouriech, Alessandro Lazaric. Active Exploration in Markov Decision Processes. AISTATS 2019.

__ Reproducibility__: Yes

__ Additional Feedback__: In the concave-convex setting in Section 4:
The Lagrangian heuristics LAGRCONPLANNER considered in the numerical experiments appear much more tractable than the basic ConRL algorithm, I feel that the authors could consider bounding the regret of those heuristics directly. In fact, LAGRCONPLANNER bears a lot of similarity to the dual based approach by Agrawal and Devanur 2011.
########### Post Rebuttal ##############
The authors have addressed the major concerns on the numerical experiments, and the overall score is increased.I look forward the authors to provide those details in the appendix for the numerical experiments.

__ Summary and Contributions__: The paper proposes an algorithm to learn policies in environments with concave rewards and convex constraints, in the tabular and episodic setting. The authors provide theoretical guarantees on the performance w.r.t reward and consumption regrets.
They also demonstrate the algorithm on 2 environments - box and mars rover with comparison against constrained policy optimization baselines.
I've read the rebuttal and going to stick to my rating.

__ Strengths__: The problem statement is clearly defined.
The assumptions are clearly laid out.
The objectives of bounding reward and consumption regrets seem fair.
I haven't verified the theoretical analysis of the claims, so can't comment on that.

__ Weaknesses__: The experimental setup could improve:
1. The environments chosen don't adequately demonstrate the resource constrained setup that they wish to deploy the algorithm in.
Specifically, the knapsack setting is interesting but a challenging environment is lacking. A real world example is that of budgets earmarked for campaigns or energy resources in games.
2. The tabular setting is helpful for theoretical analysis but it would be helpful to mention how the algorithm and analysis would translate to the function approximation case.

__ Correctness__: Not experienced in this area and hence couldn't verify the claims.

__ Clarity__: The paper is generally well written.
Nit:
Line 104: Q-function is known as the action-value function in the standard literature.

__ Relation to Prior Work__: The related work on bandits is mentioned. The algorithmic improvements appear less novel while the analysis is thorough.

__ Reproducibility__: Yes

__ Additional Feedback__: