{"title": "Understanding Dropout", "book": "Advances in Neural Information Processing Systems", "page_first": 2814, "page_last": 2822, "abstract": "Dropout is a relatively new algorithm for training neural networks which relies on stochastically dropping out'' neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. We also show in simple cases how dropout performs stochastic gradient descent on a regularized error function.\"", "full_text": "Understanding Dropout\n\nPierre Baldi\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\npfbaldi@uci.edu\n\nPeter Sadowski\n\nDepartment of Computer Science\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\npjsadows@ics.uci.edu\n\nAbstract\n\nDropout is a relatively new algorithm for training neural networks which relies\non stochastically \u201cdropping out\u201d neurons during training in order to avoid the\nco-adaptation of feature detectors. We introduce a general formalism for study-\ning dropout on either units or connections, with arbitrary probability values, and\nuse it to analyze the averaging and regularizing properties of dropout in both lin-\near and non-linear networks. For deep neural networks, the averaging properties\nof dropout are characterized by three recursive equations, including the approx-\nimation of expectations by normalized weighted geometric means. We provide\nestimates and bounds for these approximations and corroborate the results with\nsimulations. Among other results, we also show how dropout performs stochastic\ngradient descent on a regularized error function.\n\n1\n\nIntroduction\n\nDropout is an algorithm for training neural networks that was described at NIPS 2012 [7]. In its\nmost simple form, during training, at each example presentation, feature detectors are deleted with\nprobability q = 1 \u2212 p = 0.5 and the remaining weights are trained by backpropagation. All weights\nare shared across all example presentations. During prediction, the weights are divided by two.\nThe main motivation behind the algorithm is to prevent the co-adaptation of feature detectors, or\nover\ufb01tting, by forcing neurons to be robust and rely on population behavior, rather than on the\nactivity of other speci\ufb01c units. In [7], dropout is reported to achieve state-of-the-art performance on\nseveral benchmark datasets. It is also noted that for a single logistic unit dropout performs a kind of\n\u201cgeometric averaging\u201d over the ensemble of possible subnetworks, and conjectured that something\nsimilar may occur also in multilayer networks leading to the view that dropout may be an economical\napproximation to training and using a very large ensemble of networks.\nIn spite of the impressive results that have been reported, little is known about dropout from a\ntheoretical standpoint, in particular about its averaging, regularization, and convergence properties.\nLikewise little is known about the importance of using q = 0.5, whether different values of q can\nbe used including different values for different layers or different units, and whether dropout can be\napplied to the connections rather than the units. Here we address these questions.\n\n2 Dropout in Linear Networks\n\nIt is instructive to \ufb01rst look at some of the properties of dropout in linear networks, since these can\nbe studied exactly in the most general setting of a multilayer feedforward network described by an\nunderlying acyclic graph. The activity in unit i of layer h can be expressed as:\n\nSh\n\ni (I) =\n\nwhl\n\nij Sl\n\nj with S0\n\nj = Ij\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\nl 0\n\n(4)\n\nwith E(S0\nfeedforward propagation in the original network, simply replacing the weights whl\n\nj ) = Ij in the input layer. In short, the ensemble average can easily be computed by\n\nij by whl\n\nj.\nij pl\n\nl