__ Summary and Contributions__: The paper proposes a new approach of enforcing properties with the predictions RNN makes on sequence of data. The key idea builds upon knowledge distillation, in which a student network learns to predict a sequence that maximizes the posterior probability while teacher network generates a formally correct trace (w.r.t the properties to be enforced) that is the closest to the predicted trace to rein in the student network if its predicted trace violates the given properties.

__ Strengths__: Fundamentally, it's a relatively straight-forward idea but it's nicely executed and very well presented. The evaluation points to the effectiveness of the proposed approach.

__ Weaknesses__: I have several questions for the authors.
(1) in Figure 1. it seems like the teacher network also employs a sequence network but if I understood the idea correctly, all teacher network does is to find a correct trace within constraints in the form of DNF (converted from the stl) that is the closet to the trace predicted by the student network. Could you please clarify?
(2) I am a little confused about the distance metric in equation 3, specifically, the absolute term, does this mean if a variable satisfies a constrain in a clause, the term will be evaluated to 0 otherwise 1?
(3) I am a little suspicious about the scalability of algorithm 1. By that I mean what happens if there is a large number of predicates in the STL as well as variables to be predicted in the sequence. Since the authors did not touch upon this aspect in the evaluation, it would be great if they can have a discussion on this matter.
(4) Finally, a trivial question, why don't you work with transformer-based sequence models rather than RNN, LSTM given the dominance of the former?

__ Correctness__: There is nothing I can find in the paper that pose a threat to the correctness of their approach.

__ Clarity__: The paper is very well-written and easy to follow.

__ Relation to Prior Work__: I think authors have done a good job to articulate their contributions w.r.t the prior arts.

__ Reproducibility__: Yes

__ Additional Feedback__: Did not check the supplemental material as I think the paper is over the bar.
----------Updates After Author Response----------
This paper is somewhat heavily contested. To some agree, I do appreciate the other reviewers' concern on the necessity of using STL instead of FOL. But I still believe the paper has merits and worth publication. For the final version, I would strongly recommend authors to add a more detailed explanation on your choice of using STL instead of FOL.

__ Summary and Contributions__: The authors propose a new learning architecture called STLnet. They use Signal Temporal Logic (STL) to ensure RNN predictions satisfy certain properties of Cyber Physical Systems. Important types of model properties are identified and formalised using STL. A teacher network generates a trace closest to the trace predicted by the student network satisfying the model properties. The experiments show STLnet using LSTMs is much better at satisfying the model properties than regular LSTMs.

__ Strengths__: The propose architecture is innovative, and the way in which the authors combine temporal logic with deep learning is interesting. I agree with the general claim that it is challenging to ensure the output of ML models satisfies certain properties, and the way in which the authors propose to do this is quite convincing to me. Temporal logic seems like an appropriate way of specifying this, and the empirical results look promising.

__ Weaknesses__: The paper contains a number of typos and mistakes that should be corrected, and the positioning in related work could be improved.
=== After rebuttal ===
I am quite satisfied with the replies of the authors, however during the reviewer discussions I did start to doubt the novelty of the approach. The system properties do not seem novel (they appear in the literature already), and the empirical results only compare with a regular LSTM. How does this relate to other comparable approaches?

__ Correctness__: Yes.

__ Clarity__: Yes, but there are a number of typos.

__ Relation to Prior Work__: I am not familiar enough with the topic to judge this in detail, but I do think the authors' claim "To the best of our knowledge, our framework is the first work that integrates temporal logic with deep neural networks." is a bit strong. Searching for "temporal logic deep learning" on google scholar seems to contradict this claim, so they could position it in related work better.

__ Reproducibility__: Yes

__ Additional Feedback__: General comments:
- 50: I didn't really get the challenge here. The optimization mechanism of autoregressive RNNs does take into account the predicted value (or target value, if teacher forcing is applied) at the previous time step. So I don't understand what you mean by "without comparing the temporal correlation of the two sequences as a whole".
- 87: Here you say f(x) >= 0 but in the appendix f(x) > 0.
- 91: Where do these properties come from? Did you invent them yourself? How did you get to them?
- Table 1: reasonable range: I think you are missing a closing parenthesis for "(x_1 < \alpha_1)". Also the new-line is at a strange place. Also, do 0-24 represent hours in a day? This is not explained.
- Table 1: resource constraint. Why are you using [0, 24] here, but (0, 24) in "reasonable range"?
- Algorithm 1: the loop on line 38 seems wrong. Epsilon is initialized with the empty set on line 37, but never seems to become something different than the empty set. For every t, we iterate over elements of epsilon on line 41. However since epsilon is empty this is skipped. Then, in line 46 we assign the value of epsilon2 to epsilon, which is empty again, and we continue. So this case always returns an empty set.
Typos:
- line 28: "a maximum and minimum limit" or "maximum and minimum limits"
- 69: "...a more flexible way..."
- 76: "..physical world, or rules followed..."
- 105: ".. and are not necessarily the same..."
- 132: subscripts for omega should be superscripts.
- 154: subscripts for omega should be superscripts.
- 178: subsection
- 194: "that are connected"
- 256: dropping

__ Summary and Contributions__: The paper introduces STLnet. A student-teacher based approach where i) a teacher network generates traces satisfying the Signal Temporal Logic (STL) specification that are closest to a student-networks predictions, ii) these traces are then used to guide the student network to learn to satisfy the specifications better. The approach is then demonstrated on two tasks, with promising evidence regarding the utility of incorporating specification based side-information into the training procedure.

__ Strengths__: *The paper is technically correct, and the propositions are valid. The paper has interesting components, and considers a diverse set of specifications.
*The paper is clear, well-written and combines two diverse domains (specifications inspired by the CPS community + deep learning).

__ Weaknesses__: *Temporal logic as such is often useful when considering infinite traces. Signal temporal logic is quite useful in the case of finite-time traces when dealing with continuous time systems. Neither of them are under consideration here, and I think this is the biggest draw back.
*Novelty is quite limited. Much of the ideas in the paper have been introduced before. I will list some out here (which hasn't been discussed in the paper):
1. Writing STL specifications in terms of DNF specifications/using logical operators has been done previous in works such as:
a) https://arxiv.org/abs/1703.09563 (they use DNFs/MILPS, which are equivalent), b) https://openreview.net/forum?id=BklC2RNKDS
*Projecting outputs to satisfy constraints has been considered in works such as:
a)https://arxiv.org/pdf/1805.07075.pdf, b) https://arxiv.org/abs/1801.08757 c)https://arxiv.org/abs/1603.06318. In fact, if you're only looking at finite+discrete time traces, this work is quite similar to c. Without considering infinite traces, the STL specifications reduce to first-order logic specifications as considered in c already.
2. There is a rich history of consider temporal logic specifications, and even STL specifications in the CPS community. As such, the specifications introduced here are not novel/new.
3. Propositions 4.1, 4.2 and 4.3 are trivial/obvious to the best of my knowledge.
4. There's very little discussion in terms of related work regarding enforcing STL/TL specifications for learning systems -- there have been several papers attempting to do this.
a) https://robotics.sciencemag.org/content/4/37/eaay6276.full
b) https://dorsa.fyi/publications/sadigh2014learning.pdf
c) https://dl.acm.org/doi/abs/10.5555/3306127.3331994
d) https://openreview.net/forum?id=BklC2RNKDS

__ Correctness__: The propositions are correct, and the experiments are thorough.

__ Clarity__: Yes, the paper is quite clear.
Typo: Line 171, what is Step 4 here?

__ Relation to Prior Work__: I think a whole spectrum of related work has not been discussed in this work. I have included several pointers in an earlier section. I list some pieces of related work that are very closely related but haven't been included:
a) https://arxiv.org/abs/1603.06318 (Projects to a prediction satisfying a constraint, which is then used to teach a student network -- has a very similar teacher/student framework).
b) https://openreview.net/forum?id=BklC2RNKDS (enforced STL specifications for RNNs/LSTMs)
b) https://www.sciencedirect.com/science/article/pii/S240589631831139X (enforces STL specifications for DNN controllers)

__ Reproducibility__: Yes

__ Additional Feedback__: I have read the rebuttal and would like to stay with my current score as most of my concerns have not been addressed.

__ Summary and Contributions__: This paper is on the problem of integrating temporal logic knowledge/constraints with the training of recurrent networks (RNNs) to encourage the RNNs to produce logically meaningful predictions. The framework first formulates the temporal knowledge using signal temporal logic (STL). Then, during the RNN training, a teacher model is constructed which produces a sequence prediction that is closest to the target (student) model's prediction while satisfying the given STL. The teacher's prediction is then used as supervision for updating the target model. Experiments on synthetic data and a real air quality dataset show the proposed approach achieves stronger and more meaningful results than the vanilla RNNs.

__ Strengths__: - The problem of integrating structured/logical knowledge with deep NNs is of great practical signfiance yet under-explored
- The proposed approach that uses STL for temporal knowledge encoding, and constructs teacher models for supervising the target student model, looks sound
- Expriments show improvement over the vanilla LSTM on the real air quality dataset

__ Weaknesses__: - The framework has essentially the same structure with the one developed in [7], which also extends knowledge distillation [6] by first constructing a teacher model that integrates the target student's prediction with the logical knowledge, then distilling the info from the teacher to the student. The differences are:
* This paper considers temporal logic knowledge encoded with STL, instead of the first-order logic considered in [7]
* The construction of teacher model is to an extent different. This paper is by manipulating the student's prediction according to the given STL, while [7] uses posterior regularization.
The technical contribution is thus less impressive. Such a discussion about [7] is missing in the paper.
- The experiments are a bit weak, since only one real dataset is used, and the comparion is only with the vanilla LSTM model. How about other stronger baselines, such as attentional LSTM which is very popular with lots of open-source packages? It's interesting to see how/whether the proposed approach could be applied and improve over the slightly more advanced NN architectures.
- The STL trace generation in section 4.2 involves quite a few steps. What's the computation complexity of the whole process?

__ Correctness__: Correct

__ Clarity__: Yes

__ Relation to Prior Work__: More discussions about [7] is necessary, given the similarity of the algorithmic framework

__ Reproducibility__: Yes

__ Additional Feedback__: