FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training

Gao, Yunqi; Hu, Bing; Mashhadi, Boloursaz; Jin, A-Long; Zhang, Yanfeng; Xiao, Pei; Tafazolli, Rahim; DEBBAH, Merouane

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training

Yunqi Gao, Bing Hu, Boloursaz Mashhadi, A-Long Jin, Yanfeng Zhang, Pei Xiao, Rahim Tafazolli, Merouane DEBBAH

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Abstract

The parameter size of modern large language models (LLMs) can be scaled up to the trillion-level via the sparsely-activated Mixture-of-Experts (MoE) technique to avoid excessive increase of the computational costs. To further improve training efficiency, pipelining computation and communication has become a promising solution for distributed MoE training. However, existing work primarily focuses on scheduling tasks within the MoE layer, such as expert computing and all-to-all (A2A) communication, while neglecting other key operations including multi-head attention (MHA) computing, gating, and all-reduce communication. In this paper, we propose FlowMoE, a scalable framework for scheduling multi-type task pipelines. First, FlowMoE constructs a unified pipeline to consistently scheduling MHA computing, gating, expert computing, and A2A communication. Second, FlowMoE introduces a tensor chunk-based priority scheduling mechanism to overlap the all-reduce communication with all computing tasks. We implement FlowMoE as an adaptive and generic framework atop PyTorch. Extensive experiments with 675 typical MoE layers and four real-world MoE models across two GPU clusters demonstrate that our proposed FlowMoE framework outperforms state-of-the-art MoE training frameworks, reducing training time by14%-57%, energy consumption by 10%-39%, and memory usage by 7%-32%. FlowMoE’s code is anonymously available at https://anonymous.4open.science/r/FlowMoE.

Abstract

Name Change Policy