BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

Wang, Songtao; Li, Dan; Cheng, Yang; Geng, Jinkun; Wang, Yanshu; Wang, Shuai; Xia, Shu-Tao; Wu, Jianping

BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, Shu-Tao Xia, Jianping Wu

Advances in Neural Information Processing Systems 31 (NeurIPS 2018)

Abstract

In distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a new gradient synchronization algorithm with higher network performance and lower network cost than the current practice. BML runs on BCube network, instead of using the traditional Fat-Tree topology. BML algorithm is designed in such a way that, compared to the parameter server (PS) algorithm on a Fat-Tree network connecting the same number of server machines, BML achieves theoretically 1/k of the gradient synchronization time, with k/5 of switches (the typical number of k is 2∼4). Experiments of LeNet-5 and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4%.

Abstract

Name Change Policy