Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is generally well written and clear. The contributions are substantial and represent an important step in applying pretrained models like BERT to continual learning setups. The submission appears to be technically sound, there is good analysis of the various components and complexity.
This paper proposes the neat idea of using a key-value store using pre-trained sequence representations as keys in order to train a model in a streaming fashion. The paper is very clear, experiments are convincing and this work will push this direction forward.
The paper addresses the very important topic of lifelong learning, and it proposes to employ an episodic memory to avoid catastrophic forgetting. The memory is based on a key-value representation that exploits an encoder-decoder architecture based on BERT. The training is made on the concatenation of different datasets, of which there is no need to specify the identifiers. The work is highly significant and the novelty of the contribution is remarkable. One point that would have deserved more attention is the strategies for the reading and writing of the episodic memory (see also comments below). Originality: high. Quality: high. Clarity: high. Significance: high.