The authors proposes early stopping at test-time to improve inference speed and accuracy. The idea is to train a classifier at each layer of multi-layered embedding model like BERT and perform classification one layer at time, stopping when the prediction stops changing. The paper proposes a simple but effective technique that leads to good empirical results. Most excitingly, the authors report both speed-ups & higher accuracy. There are some concerns about the presented theorem, since the main assumption is unrealistic: it is assumed that the misclassification probabilities at different layers are independent of one another. This assumption should be stated more clearly, but also isn't very reasonable since misclassifications at different layers are almost certainly not independent. All in all, the theorem is not a central part of the paper, which is primarily an empirical paper. It could be worth rethinking some of the presentation as several of the reviewers were not convinced by the discussion of "overthinking". Finally, while it was not possible/needed to discuss concurrent ACL work in the submission, the AC suggests that it is discussed in the final version since ACL papers have now been out for a while.