There is a consensus among the knowledgeable reviewers that this work makes a significant contribution to the kernel community. It integrates several practical techniques and engineering efforts to further improve the scalability of the kernel machines. The techniques proposed in this work will permit the use of several GPUs in training kernel-based models with huge amount of data, which I also see as a significant contribution. Regardless of the overall score, I think this paper deserves an oral because it shows how to take full advantage of GPU hardware when solving learning problems with kernels methods. Scalability is one of the long-standing problems in kernel machines but has been largely neglected and under-appreciated in the past few years. Unlike previous work, this paper presents a solution based on non-trivial engineering efforts such as out-of-core implementation that allow us to exploit both GPU acceleration and parallelization with multiple GPUs. This development could open up new applications for kernel methods in areas that have previously been dominated by deep learning models. The paper is therefore accepted as an oral presentation. Last but not least, R1 and R2 raised an important concern regarding the empirical comparison to SVGP. I hope that the authors will take reviewers' comments into account when revising the manuscript for the camera-ready version by improving the justification of their empirical comparison.