Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes a novel technique to reduce memory consumption for training deep sequence models by using fixed point formulation. The technique requires constant memory regardless of the effective depth of the network. All the reviewers have found the work to be of sufficient novelty and interest to have it published at NeurIPS. The algorithms are theoretically justified. Overall the paper is well-written and easy to follow. The authors’ rebuttal has addressed the reviewers’ concern around comparison with gradient checkpointing, However, the proposed method is much slower in training as well as inference which may restrict the practical utility of the work for very large datasets. Pls include this discussion in the revised version.