This paper introduces an RL method that satisfies safety constraints during both training and evaluation, via shielding for continuous state and action spaces so that unsafe actions are not selected. The main technical contribution is that there is a symbolic safety specification and policy, which is lifted in a continuous space via imitation learning. Policy updates occur in the lifted space, and then the policy is projected back to a symbolic space where verification can occur. The method has the added advantage that the definition of symbolic safe policies and safety specifications can increase over time as more experience is collected from interactions with the environment. I think this is an interesting scheme, and there are not many safe RL methods that can guarantee safety during training while expanding the safe set. A major drawback of the method is that it seems very time-consuming to update the policy, which was the main concern of the reviewers in the post-rebuttal discussion, but we all think that's a separate issue that doesn't detract from the merits of the method. Reviewers recommend a borderline accept. I agree and would recommend a poster.