Summary and Contributions: This work proposes a methodology to train massive neural networks over distributed, heterogeneous and unreliable hardware. To do this, it proposes a Decentralized Mixture-of-Expert (DMoE) layer along with a training framework, which extends DMoE models over such training infrastructures. However, though this sparse architecture is generally amenable to the demands of the application, the authors note that MoE models scale poorly over flat layers of experts. They therefore propose a new gating function via product key layers and a distributed hash table. I think this is a very promising direction and I strongly encourage the authors to continue this research and build the training infrastructure. However, as of this paper, it reads more as a proposal document and is light on empirical results.
Strengths: * This paper addresses an important problem for scaling and suggests a path forward with the compute available with consumer-grade PCs connected via the internet. This is an ambitious and an important problem. * The early empirical results demonstrate the efficacy of Learning@home versus model-parallel systems and the graceful degradation of MoE layers with high latency is encouraging.
Weaknesses: * Measuring the performance and tradeoffs of this system against established performance benchmarks would be necessary. This work supports that the network converges, but it is not clear what performance trade-off exists with this new architecture and asynchronous updates. For instance, does this architecture and training regime facilitate competitive natural language processing or computer vision performance? The only performance measure currently benchmarked is that this architecture/system trains MNIST models that exceed 95%+ validation accuracy. * The related work extends nearly three pages. The experimental section, arguably the most important for an empirical and engineering research, is squeezed into a page. This is well-motivated work, but more empirical testing and further details of the infrastructure is needed. * There is of course no demand that the authors to demonstrate the full potential of an idea ("However, reaching the full potential of this idea is a monumental task well outside the restrictions of one publication."), but I'd generally recommend withholding potential subjective measures of what is or is not within the scope of a publication.
Correctness: All claims and methods appear reasonable and correct
Clarity: * The approach is clearly delineated, but I might suggest to the authors a less conversational tone in scientific writing (L49: “all that power”, L50: “way slower”). This is of course a matter of style, but some readers may find it distracting. * As one recommendation, aim to strengthen your conclusion. You’ve laid out an important and interesting research direction, but the conclusion does this paper a disservice. Don’t squeeze it in at the bottom as an after-thought. * Recommend a separate dedicated section for Training Infrastructure, one of the primary contributions. * What is a block architecture? The model for the MNIST digit recognition is unclear.
Relation to Prior Work: Volunteer computing over internet-connected serves and devices is a widespread technique, but the specific application of a Mixture-of-Expert layer is novel as far as I am aware.
Additional Feedback: * Why does Learning@home have a higher throughput in examples/second than the model-parallel system at 0ms network delay? * How do existing deep learning frameworks fail to support DMoE architecture? ============= Post review: Thank you for the additional experiments and revisions. I have raised my score accordingly.
Summary and Contributions: The work introduces the merger of ideas from distributed hashtables and mixture of experts with the goal of enabling volunteer computing in large scale ML.
Strengths: The paper is very well written. I did not find a single typo or odd formulation. This is very much appreciated. The core idea is solid. I really like the thinking down these lines, and how we can make ML more accessible to more researchers. Using spare CPU and GPU cycles is a great way to do that.
Weaknesses: The main motivation in the paper is a bit simplistic and inwards-facing on the ML community. The impact statement expands on this and the challenges of what work *doesn't* get done because only work related to the interests of large organizations can get done. I'd advise the authors to move some of that thinking into the main paper. The evaluation is weak, and the authors admit as much. Without deployment, it is near impossible to say whether this approach can work in practice. I am also skeptical of some of the assumptions (e.g. symmetrical 100MBit/s connections), but have equally no empirical basis for that skepticism.
Correctness: Very hard to say. The arguments laid out are clear and believable. However, this being a distributed systems proposal on volunteer hardware, I have no way of knowing whether it actually works.
Clarity: Yes, very much so. Thank you!
Relation to Prior Work: Yes. However, I would encourage to spend a few more lines on DHTs and their core mechanisms. The NeurIPS audience is generally not well versed in systems.
Additional Feedback: I wonder whether this would be better as a demo, followed by a deployment and then a paper. This would be infinitely more exciting to read if there was a section on practical experiences with the proposed system. Comments added after author feedback ============================== Thank you for making the time to address my concerns. My main concern remains that it is hard to know how a P2P system will perform until it is deployed. Hence, I cannot raise my score.
Summary and Contributions: This paper raises the question that what would be the solution if the model and data size are 1000x than the current ones. It points out that volunteer computing could be the right direction by proposing a framework/system to train large neural networks on volunteer hardware. That hardware is usually poor devices like regular PCs. Finally, it evaluates the prototype on several real tasks.
Strengths: 1. Although making use of the poor individual devices for distributed deep learning model training is not a new idea, building a working prototype or a system is a real contribution. 2. Also, although the idea of moe is there for a while but building this decentralized moe in volunteer computing can be useful to the community from the system perspective. 3. The paper is well-written. Specifically I enjoy the flow (intro is very interesting and I appreciate the efforts in related work to make the paper more fluent) and every point it is making is crystal clear. In general, the major reason why I like this paper is that 1) machine learning or deep learning at scale will someday absolutely become a major challenge and this paper provides a good angle to look at this problem 2) there are many ideas/advances in DL everyday but not all of them providing a working system. But also because of this, I didn't give a high score. Details in weaknesses. --------------UPDATE-------------- To clarify what the author feedback states: I'm not claiming the distributed training is not a new idea, but I'm saying Mixture of experts (MOE) has been out and show advantages in large-scale training for a while.
Weaknesses: 1. To me the major drawback of the paper is its evaluation. After reading the inspiring introduction. I was expecting a billion trillion scale evaluation and surprising number in the end because that was the whole motivation. It is very disappointing to see this kind of poor evaluations in a paper with such a good start. 2. While I understand the importance of first two sections, the intro and background section has over four pages. I recommend shrinking these sections and adding more evaluations. I see authors state in the beginning of the evaluation that it is toy for easy reproducibility. If authors could provide further large scale experiments/evaluations, I will raise my score. Otherwise it is hard to convince audiences that this paper is solving a trillion scale problem but the evaluation is on MNIST. ---------------Update------------------- I'm raising my score because the author has shown a relative larger scale deployment with transformer-xl and wiki2. (The dataset is not that large but much better than MNIST) But the main reason is still, I think this direction needs some support and encouragement. I hope that, despite this paper got accepted or not, the authors can keep up such practical research and friendly code-base.
Relation to Prior Work: Yes.