NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4626 Learning Deterministic Weighted Automata with Queries and Counterexamples

### Reviewer 1

They introduce an algorithm for inferring "probabilistic deterministic finite automatons" from a probabilistic language model oracle that can compute P(next_symbol | string_so_far), which is exactly what a autoregressive neural language model, like a rnn, would give you. The paper is well-written, the problem is well motivated, and the algorithm appears to be better than the prior work, as measured by the experiments. However I am not qualified to adequately review this paper, because I'm not familiar with any of the related work. For this reason, I lean toward accepting the paper, but I have very little confidence in my assessment. The only problems I have with this paper are with the experiments. They evaluate on SPICE data sets, which appear to be standard data sets for finite state language model learning. However, in looking at the cited SPICE paper, there are actually 16 SPICE data sets, and they only evaluate on 4 of them. Why only these four? Are the other twelve too hard? It would be okay if your algorithm cannot handle the other twelve, provided that the prior work also does poorly on this test cases. Relatedly, what would happen if you apply your algorithm to a language model trained on a large natural language corpus? I imagine that this is an ultimate intended use case of distilling finite state machines from neural language models, so that for example you could use the (finite-state) language model on low power or embedded devices where floating-point operations are at a premium. In a similar vein, state-of-the-art language models for natural language are based on transformers, and as far as I can tell your algorithm would work equally well with a transformer model, or indeed any autoregressive model. This might be an interesting experiment to do, or at least mention as a possibility. "Probabilistic deterministic" finite state machines sounds like an oxymoron, but really what is meant are probabilistic symbol emissions and deterministic state transitions. For readers like me that are not familiar with this literature, a quick sentence in the introduction clarifying this would be helpful (the third paragraph and lines 112-115 make the meaning clear, but it would be helpful if it came a bit earlier). The equation on line 125 might have a typo in it: $\sigma$ appears as a free variable in the left-hand side, but does not occur in the right hand side. Did you mean $... = \frac{P^p(w\sigma)}{P^p(w)}$ ? This would make sense for calculating the probability of $\sigma$ following $w$.

### Reviewer 2

I think the paper is in a very interesting area but feels unfinished. The paper is reasonably well structured, though some parts felt repetitive (e.g. lines 45-75 in the introduction). Some definitions were difficult to find (typos + small comments are at the end of this section), but overall the descriptions of the background and model were easy to follow. Ultimately, the experiments appear inconclusive. The paper only shows results for SpiCe challenge 0-3, where challenge 1-3 are synthetic examples (simplest) and challenge 0 were not a part of the competition scoring. So, it is unclear how the model will scale. The Tomita experiments were also small and inconclusive. It is also unclear how the RNN hidden size is chosen and how it will affect the WL and WFA algorithms. It would also be helpful to understand how different hyperparameters of the model affect results. The variation tolerance $t$ matters a lot. It would be interesting to understand why t=0.1 is chosen, and what the effect is in altering $t$ for all experiments. The table expansion time and the random seed probably also affects performance. I am curious about some design choices. For example, what is the significance of using total variation vs other distance measures (e.g. KL divergence or cross-entropy)? Also, in the experiments, it’s not clear whether Step 3 (answering equivalence query) is ever used. What should be done if the hypothesis is not accepted? Typos and small comments: * The notation $w^k$ in line 108 should be (I think) $w_{:k}$ * The colon in line 28 should be a comma * I didn’t see an explanation for what the contents of the observation table $O$ is -- with the exception of lines 182-183, which states that the observations are tokens and not distributions. But the following line 184, along with the rest of the paper, discusses the comparison between *next-token distributions*. (Are you using RNN prediction probabilities for comparison or the empirical probabilities from the samples?) * The notation $q^i$ doesn’t appear to be defined. I assume it is referring to the initial state. Maybe use the notation $q_0$ instead, to since $i$ is often used as an index? * In line 125, the definition of the last token probability, the equation should be \$\frac{P^p(w \cdot s)}{P^p(w)} * In line 193, “prefix weight” does not seem to be defined * The Tomita experiment results were difficult to find. Maybe a subsection heading or a simple figure would help

### Reviewer 3

This paper presents a technique for constructing a probabilistic deterministic finite automaton (PDFA) that can model a black-box language model such as LM-RNN. The main idea of the algorithm is to adapt Angulin’s L* algorithm to handle probabilistic choices with unbounded states of an LM-RNN by developing a notion of variation tolerance. The variation tolerance allows for comparing two probability vectors and clustering them if they are within a t-threshold. The goal is to construct a PDFA, such that for all prefixes the next token-distributions in the PDFA and the LM-RNN is within the variation bound. The paper presents analogous variation tolerance aware extensions to membership and equivalence queries in L*. The algorithm first learns an observation table that is closed and consistent, then constructs the corresponding PDFA using a clustering strategy, and finally performs an equivalence query using a sampling-based method. The technique is evaluated on grammars from the SPiCe competition and adaptations of Tomita grammars, and it outperforms n-gram and WFA (using spectral learning) baselines both in terms of WER and NDCG rates. This paper presents a novel general technique to learn weighted deterministic finite automaton (WDFA) from a given weighted black-box target, and learns a PDFA for a stochastic target. Unlike previous approaches that use spectral learning or learn from sampled sequences, the presented technique learns PDFA in an active learning setting using an oracle by extending the widely used L* algorithm, which has some nice guarantees. The idea of using t-consistency for computing variations between probability vectors is quite elegant, and the idea of using clustering techniques to overcome non-transitivity of t-equality is also quite interesting. There is also a detailed formal treatment of the properties and guarantees of the returned PDFA in terms of t-consistency. The paper is also nicely written and explains the key ideas of the algorithm in required details. One suggestion would be to add a small running example that might help clarify the extended L* algorithm a bit more (especially the step 2 of PDFA construction). It would also be nice to move at least the two theorems from appendix to the main text. I was curious about the scalability of the algorithm. It seems it scales better than spectral learning based WFA learning methods based on SPiCe and Tomita benchmarks. But the LSTM network consists of an input dimension of 10 and hidden dimension of 20, which seems quite small compared to current state-of-the-art LSTM language models. What is the maximum hidden sizes the technique can handle currently? Similarly, the alphabet size of upto 20 also seems a bit small compared to typical large vocabularies of state-of-the-art language models. It would be helpful to report the maximum sizes the technique can scale to currently, and that might help spur interesting future works to scale up the technique further. It wasn’t clear whether for the results reported in Table 1, did the L* algorithm terminate before early stopping. If it didn’t terminate, what bounds on expansion time and suffix limits might be needed for full completion of L* algorithm on these benchmarks. Since the equivalence check is being performed using sampling, the only difference between previous PDFA learning methods from samples would be that in this case the samples are being collected actively. Would it be possible to compare the presented L* based technique to also some of those previous PDFA reconstruction techniques that learn from samples? Minor issues: line 108: w^k \in S -> w_{:k} \in S line 165: algorithm reduce the number -> algorithm reduces the number line 193: prefix weight: -> prefix weight. line 336: The the -> The