The Effect of Network Width on the Performance of Large-batch Training

Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018)

Bibtex Metadata Paper Reviews Supplemental

Authors

Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, Paraschos Koutris

Abstract

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance.

In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.