{"title": "Swapout: Learning an ensemble of deep architectures", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "We describe Swapout, a new stochastic training method, that outperforms ResNets of identical network structure yielding impressive results on CIFAR-10 and CIFAR-100. Swapout samples from a rich set of architectures including dropout, stochastic depth and residual architectures as special cases. When viewed as a regularization method swapout not only inhibits co-adaptation of units in a layer, similar to dropout, but also across network layers. We conjecture that swapout achieves strong regularization by implicitly tying the parameters across layers. When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth. We propose a parameterization that reveals connections to exiting architectures and suggests a much richer set of architectures to be explored. We show that our formulation suggests an efficient training method and validate our conclusions on CIFAR-10 and CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider model performs similar to a 1001 layer ResNet model.", "full_text": "Swapout: Learning an ensemble of deep architectures\n\nSaurabh Singh, Derek Hoiem, David Forsyth\n\nDepartment of Computer Science\n\nUniversity of Illinois, Urbana-Champaign\n{ss1, dhoiem, daf}@illinois.edu\n\nAbstract\n\nWe describe Swapout, a new stochastic training method, that outperforms ResNets\nof identical network structure yielding impressive results on CIFAR-10 and CIFAR-\n100. Swapout samples from a rich set of architectures including dropout [20],\nstochastic depth [7] and residual architectures [5, 6] as special cases. When viewed\nas a regularization method swapout not only inhibits co-adaptation of units in\na layer, similar to dropout, but also across network layers. We conjecture that\nswapout achieves strong regularization by implicitly tying the parameters across\nlayers. When viewed as an ensemble training method, it samples a much richer\nset of architectures than existing methods such as dropout or stochastic depth.\nWe propose a parameterization that reveals connections to exiting architectures\nand suggests a much richer set of architectures to be explored. We show that our\nformulation suggests an ef\ufb01cient training method and validate our conclusions on\nCIFAR-10 and CIFAR-100 matching state of the art accuracy. Remarkably, our 32\nlayer wider model performs similar to a 1001 layer ResNet model.\n\n1\n\nIntroduction\n\nThis paper describes swapout, a stochastic training method for general deep networks. Swapout\nis a generalization of dropout [20] and stochastic depth [7] methods. Dropout zeros the output of\nindividual units at random during training, while stochastic depth skips entire layers at random during\ntraining. In comparison, the most general swapout network produces the value of each output unit\nindependently by reporting the sum of a randomly selected subset of current and all previous layer\noutputs for that unit. As a result, while some units in a layer may act like normal feedforward units,\nothers may produce skip connections and yet others may produce a sum of several earlier outputs. In\neffect, our method averages over a very large set of architectures that includes all architectures used\nby dropout and all used by stochastic depth.\nOur experimental work focuses on a version of swapout which is a natural generalization of the\nresidual network [5, 6]. We show that this results in improvements in accuracy over residual networks\nwith the same number of layers.\nImprovements in accuracy are often sought by increasing the depth, leading to serious practical\ndif\ufb01culties. The number of parameters rises sharply, although recent works such as [19, 22] have\naddressed this by reducing the \ufb01lter size [19, 22]. Another issue resulting from increased depth is\nthe dif\ufb01culty of training longer chains of dependent variables. Such dif\ufb01culties have been addressed\nby architectural innovations that introduce shorter paths from input to loss either directly [22, 21, 5]\nor with additional losses applied to intermediate layers [22, 12]. At the time of writing, the deepest\nnetworks that have been successfully trained are residual networks (1001 layers [6]). We show that\nincreasing the depth of our swapout networks increases their accuracy.\nThere is compelling experimental evidence that these very large depths are helpful, though this may\nbe because architectural innovations introduced to make networks trainable reduce the capacity of\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe layers. The theoretical evidence that a depth of 1000 is required for practical problems is thin.\nBengio and Dellaleau argue that circuit ef\ufb01ciency constraints suggest increasing depth is important,\nbecause there are functions that require exponentially large shallow networks to compute [1]. Less\nexperimental interest has been displayed in the width of the networks (the number of \ufb01lters in a\nconvolutional layer). We show that increasing the width of our swapout networks leads to signi\ufb01cant\nimprovements in their accuracy; an appropriately wide swapout network is competitive with a deep\nresidual network that is 1.5 orders of magnitude deeper and has more parameters.\nContributions: Swapout is a novel stochastic training scheme that can sample from a rich set of\narchitectures including dropout, stochastic depth and residual architectures as special cases. Swapout\nimproves the performance of the residual networks for a model of the same depth. Wider but much\nshallower swapout networks are competitive with very deep residual networks.\n\n2 Related Work\n\nConvolutional neural networks have a long history (see the introduction of [11]). They are now\nintensively studied as a result of recent successes (e.g. [9]). Increasing the number of layers in\na network improves performance [19, 22] if the network can be trained. A variety of signi\ufb01cant\narchitectural innovations improve trainability, including: the ReLU [14, 3]; batch normalization [8];\nand allowing signals to skip layers.\nOur method exploits this skipping process. Highway networks use gated skip connections to allow\ninformation and gradients to pass unimpeded across several layers [21]. Residual networks use\nidentity skip connections to further improve training [5]; extremely deep residual networks can be\ntrained, and perform well [6]. In contrast to these architectures, our method skips at the unit level\n(below), and does so randomly.\nOur method employs randomness at training time. For a review of the history of random methods,\nsee the introduction of [16], which shows that entirely randomly chosen features can produce an\nSVM that generalizes well. Randomly dropping out unit values (dropout [20]) discourages co-\nadaptation between units. Randomly skipping layers (stochastic depth) [7] during training reliably\nleads to improvements at test time, likely because doing so regularizes the network. The precise\ndetails of the regularization remain uncertain, but it appears that stochastic depth represents a form\nof tying between layers; when a layer is dropped, other layers are encouraged to be able to replace\nit. Each method can be seen as training a network that averages over a family of architectures\nduring inference. Dropout averages over architectures with \u201cmissing\u201d units and stochastic depth\naverages over architectures with \u201cmissing\u201d layers. Other successful recent randomized methods\ninclude dropconnect [23] which generalizes dropout by dropping individual connections instead of\nunits (so dropping several connections together), and stochastic pooling [24] (which regularizes by\nreplacing the deterministic pooling by randomized pooling). In contrast, our method skips layers\nrandomly at a unit level enjoying the bene\ufb01ts of each method.\nRecent results show that (a) stochastic gradient descent with suf\ufb01ciently few steps is stable (in the\nsense that changes to training data do not unreasonably disrupt predictions) and (b) dropout enhances\nthat property, by reducing the value of a Lipschitz constant ([4], Lemma 4.4). We show our method\nenjoys the same behavior as dropout in this framework.\nLike dropout, the network trained with swapout depends on random variables. A reasonable strategy\nat test time with such a network is to evaluate multiple instances (with different samples used for\nthe random variables) and average. Reliable improvements in accuracy are achievable by training\ndistinct models (which have distinct sets of parameters), then averaging predictions [22], thereby\nforming an explicit ensemble. In contrast, each of the instances of our network in an average would\ndraw from the same set of parameters (we call this an implicit ensemble). Srivastava et al. argue\nthat, at test time, random values in a dropout network should be replaced with expectations, rather\nthan taking an average over multiple instances [20] (though they use explicit ensembles, increasing\nthe computational cost). Considerations include runtime at test; the number of samples required;\nvariance; and experimental accuracy results. For our model, accurate values of these expectations are\nnot available. In Section 4, we show that (a) swapout networks that use estimates of these expectations\noutperform strong comparable baselines and (b) in turn, these are outperformed by swapout networks\nthat use an implicit ensemble.\n\n2\n\n\fFigure 1: Visualization of architectural differences, showing computations for a block using various\narchitectures. Each circle is a unit in a grid corresponding to spatial layout, and circles are colored to\nindicate what they report. Given input X (a), all units in a feed forward block emit F (X) (b). All\nunits in a residual network block emit X + F (X) (c). A skipforward network randomly chooses\nbetween reporting X and F (X) per unit (d). Finally, swapout randomly chooses between reporting\n0 (and so dropping out the unit), X (skipping the unit), F (X) (imitating a feedforward network at\nthe unit) and X + F (X) (imitating a residual network unit).\n\n3 Swapout\nNotation and terminology: We use capital letters to represent tensors and (cid:12) to represent element-\nwise product (broadcasted for scalars). We use boldface 0 and 1 to represent tensors of 0 and\n1 respectively. A network block is a set of simple layers in some speci\ufb01c con\ufb01guration e.g. a\nconvolution followed by a ReLU or a residual network block [5]. Several such potentially different\nblocks can be connected in the form of a directed acyclic graph to form the full network model.\nDropout kills individual units randomly; stochastic depth skips entire blocks of units randomly.\nSwapout allows individual units to be dropped, or to skip blocks randomly. Implementing swapout is\na straightforward generalization of dropout. Let X be the input to some network block that computes\nF (X). The u\u2019th unit produces F (u)(X) as output. Let \u0398 be a tensor of i.i.d. Bernoulli random\nvariables. Dropout computes the output Y of that block as\nY = \u0398 (cid:12) F (X).\n\n(1)\nIt is natural to think of dropout as randomly selecting an output from the set F (u) = {0, F (u)(X)}\nfor the u\u2019th unit.\nSwapout generalizes dropout by expanding the choice of F (u). Now write {\u0398i} for N distinct tensors\nof iid Bernoulli random variables indexed by i and with corresponding parameters {\u03b8i}. Let {Fi} be\ncorresponding tensors consisting of values already computed somewhere in the network. Note that\none of these Fi can be X itself (identity). However, Fi are not restricted to being a function of X and\nwe drop the X to indicate this. Most natural choices for Fi are the outputs of earlier layers. Swapout\ncomputes the output of the layer in question by computing\n\n(2)\n\nN(cid:88)\n\n\u0398i (cid:12) Fi\n\nY =\n\ni=1\n, F (u)\n\n2\n\n1\n\n, . . . ,(cid:80)\n\nand so, for unit u, we have F (u) = {F (u)\nsimplest case where\n\n}. We study the\n(3)\nso that, for unit u, we have F (u) = {0, X (u), F (u)(X), X (u) + F (u)(X)}. Thus, each unit in the\nlayer could be:\n\nY = \u03981 (cid:12) X + \u03982 (cid:12) F (X)\n\ni F (u)\n\n, . . . , F (u)\n\n1 + F (u)\n\n2\n\ni\n\n1) dropped (choose 0);\n2) a feedforward unit (choose F (u)(X));\n3) skipped (choose X (u));\n4) or a residual network unit (choose X (u) + F (u)(X)).\n\n3\n\nX+F(X)0F(X)XSwapout Y=\u21e51X+\u21e52F(X)(e) SkipForward Y=\u21e5X+(1\u21e5)F(X)(d) ResNet Y=X+F(X)(c) X(a) F(X)(b) FeedForward Input Output \fSince a swapout network can clearly imitate a residual network, and since residual networks are\ncurrently the best-performing networks on various standard benchmarks, we perform exhaustive\nexperimental comparisons with them.\nIf one accepts the view of dropout and stochastic depth as averaging over a set of architectures, then\nswapout extends the set of architectures used. Appropriate random choices of \u03981 and \u03982 yield: all\narchitectures covered by dropout; all architectures covered by stochastic depth; and block level skip\nconnections. But other choices yield unit level skip and residual connections.\nSwapout retains important properties of dropout. Swapout discourages co-adaptation by dropping\nunits, but also by on occasion presenting units with inputs that have come from earlier layers. Dropout\nhas been shown to enhance the stability of stochastic gradient descent ([4], lemma 4.4). This applies\nto swapout in its most general form, too. We extend the notation of that paper, and write L for\na Lipschitz constant that applies to the network, \u2207f (v) for the gradient of the network f with\nparameters v, and D\u2207f (v) for the gradient of the dropped out version of the network.\nThe crucial point in the relevant enabling lemma is that E[|| Df (v)||] < E[||\u2207f (v)||] \u2264 L (the\ninequality implies improvements). Now write \u2207S [f ] (v) for the gradient of a swapout network, and\n\u2207G [f ] (v) for the gradient of the swapout network which achieves the largest Lipschitz constant by\nchoice of \u0398i (this exists, because \u0398i is discrete). First, a Lipschitz constant applies to this network;\nsecond, E[||\u2207S [f ] (v)||] \u2264 E[||\u2207G [f ] (v)||] \u2264 L, so swapout makes stability no worse; third, we\nspeculate light conditions on f should provide E[||\u2207S [f ] (v)||] < E[||\u2207G [f ] (v)||] \u2264 L, improving\nstability ([4] Section 4).\n\n3.1\n\nInference in Stochastic Networks\n\nA model trained with swapout represents an entire family of networks with tied parameters, where\nmembers of the family were sampled randomly during training. There are two options for inference.\nEither replace random variables with their expected values, as recommended by Srivastava et al. [20]\n(deterministic inference). Alternatively, sample several members of the family at random, and average\ntheir predictions (stochastic inference). Note that such stochastic inference with dropout has been\nstudied in [2].\nThere is an important difference between swapout and dropout. In a dropout network, one can\nestimate expectations exactly (as long as the network isn\u2019t trained with batch normalization, below).\nThis is because E[ReLU[\u0398 (cid:12) F (X)]] = ReLU[E[\u0398 (cid:12) F (X)]] (recall \u0398 is a tensor of Bernoulli\nrandom variables, and thus non-negative).\nIn a swapout network, one usually can not estimate expectations exactly. The problem is that\nE[ReLU[(\u03981X + \u03982Y )]] is not the same as ReLU[E[(\u03981X + \u03982Y )]] in general. Estimates of\nexpectations that ignore this are successful, as the experiments show, but stochastic inference gives\nsigni\ufb01cantly better results.\nSrivastava et al. argue that deterministic inference is signi\ufb01cantly less expensive in computation.\nWe believe that Srivastava et al. may have overestimated how many samples are required for an\naccurate average, because they use distinct dropout networks in the average (Figure 11 in [20]).\nOur experience of stochastic inference with swapout has been positive, with the number of samples\nneeded for good behavior small (Figure 2). Furthermore, computational costs of inference are smaller\nwhen each instance of the network uses the same parameters\nA technically more delicate point is that both dropout and swapout networks interact poorly with batch\nnormalization if one uses deterministic inference. The problem is that the estimates collected by batch\nnormalization during training may not re\ufb02ect test time statistics. To see this consider two random\nvariables X and Y and let \u03981, \u03982 \u223c Bernoulli(\u03b8). While E[\u03981X + \u03982Y ] = E[\u03b8X + \u03b8Y ] =\n\u03b8X + \u03b8Y , it can be shown that Var[\u03981X + \u03982Y ] \u2265 Var[\u03b8X + \u03b8Y ] with equality holding only for\n\u03b8 = 0 and \u03b8 = 1. Thus, the variance estimates collected by Batch Normalization during training do\nnot represent the statistics observed during testing if the expected values of \u03981 and \u03982 are used in a\ndeterministic inference scheme. These errors in scale estimation accumulate as more and more layers\nare stacked. This may explain why [7] reports that dropout doesn\u2019t lead to any improvement when\nused in residual networks with batch normalization.\n\n4\n\n\f3.2 Baseline comparison methods\n\nResNets: We compare with ResNet architectures as described in [5](referred to as v1) and\nin [6](referred to as v2).\n\nDropout: Standard dropout on the output of residual block using Y = \u0398 (cid:12) (X + F (X)).\nLayer Dropout: We replace equation 3 by Y = X + \u0398(1\u00d71)F (X). Here \u0398(1\u00d71) is a single\nBernoulli random variable shared across all units.\n\nSkipForward: Equation 3 introduces two stochastic parameters \u03981 and \u03982. We also explore a\nsimpler architecture, SkipForward, that introduces only one parameter but samples from a smaller set\nF (u) = {X (u), F (u)(X)} as below. A parallel work refers to this as zoneout [10].\n\nY = \u0398 (cid:12) X + (1 \u2212 \u0398) (cid:12) F (X)\n\n(4)\n\n4 Experiments\n\nWe experiment extensively on the CIFAR-10 dataset and demonstrate that a model trained with\nswapout outperforms a comparable ResNet model. Further, a 32 layer wider model matches the\nperformance of a 1001 layer ResNet on both CIFAR-10 and CIFAR-100 datasets.\n\nModel: We experiment with ResNet architectures as described in [5](referred to as v1) and\nin [6](referred to as v2). However, our implementation (referred to as ResNet Ours) has the following\nmodi\ufb01cations which improve the performance of the original model (Table 1). Between blocks of\ndifferent feature sizes we subsample using average pooling instead of strided convolutions and use\nprojection shortcuts with learned parameters. For \ufb01nal prediction we follow a scheme similar to Net-\nwork in Network [13]. We replace average pooling and fully connected layer by a 1 \u00d7 1 convolution\nlayer followed by global average pooling to predict the logits that are fed into the softmax.\nLayers in ResNets are arranged in three groups with all convolutional layers in a group containing\nequal number of \ufb01lters. We represent the number of \ufb01lters in each group as a tuple with the smallest\nsize as (16, 32, 64) (as used in [5]for CIFAR-10). We refer to this as width and experiment with\nvarious multiples of this base size represented as W \u00d7 1, W \u00d7 2 etc.\nTraining: We train using SGD with a batch size of 128, momentum of 0.9 and weight decay of\n0.0001. Unless otherwise speci\ufb01ed, we train all the models for a total 256 epochs. Starting from an\ninitial learning rate of 0.1, we drop it by a factor of 10 after 192 epochs and then again after 224\nepochs. Standard augmentation of left-right \ufb02ips and random translations of up to four pixels is used.\nFor translation, we pad the images by 4 pixels on all the sides and sample a random 32 \u00d7 32 crop.\nAll the images in a mini-batch use the same crop. Note that dropout slows convergence ([20], A.4),\nand swapout should do so too for similar reasons. Thus using the same training schedule for all the\nmethods should disadvantage swapout.\n\nModels trained with Swapout consistently outperform baselines: Table 1 compares Swapout\nwith various 20 layer baselines. Models trained with Swapout consistently outperform all other\nmodels of similar architecture.\n\nThe stochastic training schedule matters: Different layers in a swapout network could be trained\nwith different parameters of their Bernoulli distributions (the stochastic training schedule). Table 2\nshows that stochastic training schedules have a signi\ufb01cant effect on the performance. We report the\nperformance with deterministic as well as stochastic inference. These schedules differ in how the\nvalues of parameters \u03b81 and \u03b82 of the random variables in equation 3 are set for different layers. Note\nthat \u03b81 = \u03b82 = 0.5 corresponds to the maximum stochasticity. A schedule with less randomness in\nthe early layers (bottom row) performs the best because swapout adds per unit noise and early layers\nhave the largest number of units. Thus, low stochasticity in early layers signi\ufb01cantly reduces the\nrandomness in the system. We use this schedule for all the experiments unless otherwise stated.\n\n5\n\n\fTable 1: In comparison with fair baselines on CIFAR-10, swapout is always more accurate. We refer\nto the base width of (16, 32, 64) as W \u00d7 1 and others are multiples of it (See Table 3 for details on\nwidth). We report the width along with the number of parameters in each model. Models trained\nwith swapout consistently outperform all other models of comparable architecture. All stochastic\nmethods were trained using the Linear(1, 0.5) schedule (Table 2) and use stochastic inference. v1 and\nv2 represent residual block architectures in [5] and [6] respectively.\n\nMethod\nWidth\nResNet v1 [5]\nW \u00d7 1\nResNet v1 Ours\nW \u00d7 1\nSwapout v1\nW \u00d7 1\nResNet v2 Ours\nW \u00d7 1\nSwapout v2\nW \u00d7 1\nSwapout v1\nW \u00d7 2\nResNet v2 Ours\nW \u00d7 2\nStochastic Depth v2 Ours W \u00d7 2\nDropout v2\nW \u00d7 2\nSkipForward v2\nW \u00d7 2\nSwapout v2\nW \u00d7 2\n\n#Params Error(%)\n0.27M\n0.27M\n0.27M\n0.27M\n0.27M\n1.09M\n1.09M\n1.09M\n1.09M\n1.09M\n1.09M\n\n8.75\n8.54\n8.27\n8.27\n7.97\n6.58\n6.54\n5.99\n5.87\n6.11\n5.68\n\nTable 2: The choice of stochastic training schedule matters. We evaluate the performance of a 20\nlayer swapout model (W \u00d7 2) trained with different stochasticity schedules on CIFAR-10. These\nschedules differ in how the parameters \u03b81 and \u03b82 of the Bernoulli random variables in equation 3 are\nset for the different layers. Linear(a, b) refers to linear interpolation from a to b from the \ufb01rst block\nto the last (see [7]). Others use the same value for all the blocks. We report the performance for both\nthe deterministic and stochastic inference (with 30 samples). Schedule with less randomness in the\nearly layers (bottom row) performs the best.\n\nMethod\nSwapout (\u03b81 = \u03b82 = 0.5)\nSwapout (\u03b81 = 0.2, \u03b82 = 0.8)\nSwapout (\u03b81 = 0.8, \u03b82 = 0.2)\nSwapout (\u03b81 = \u03b82 = Linear(0.5, 1))\nSwapout (\u03b81 = \u03b82 = Linear(1, 0.5))\n\nDeterministic Error(%)\n10.36\n10.14\n7.58\n7.34\n6.43\n\nStochastic Error(%)\n6.69\n7.63\n6.56\n6.52\n5.68\n\nSwapout improves over ResNet architecture: From Table 3 it is evident that networks trained\nwith Swapout consistently show better performance than corresponding ResNets, for most choices\nof width investigated, using just the deterministic inference. This difference indicates that the\nperformance improvement is not just an ensemble effect.\n\nStochastic inference outperforms deterministic inference: Table 3 shows that the stochastic\ninference scheme outperforms the deterministic scheme in all the experiments. Prediction for each\nimage is done by averaging the results of 30 stochastic forward passes. This difference is not just\ndue to the widely reported effect that an ensemble of networks is better as networks in our ensemble\nshare parameters. Instead, stochastic inference produces more accurate expectations and interacts\nbetter with batch normalization.\n\nStochastic inference needs few samples for a good estimate: Figure 2 shows the estimated\naccuracies as a function of the number of forward passes per image. It is evident that relatively few\nsamples are enough for a good estimate of the mean. Compare Figure-11 of [20], which implies \u223c 50\nsamples are required.\n\nIncrease in width leads to considerable performance improvements: The number of \ufb01lters in\na convolutional layer is its width. Table 3 shows that the performance of a 20 layer model improves\nconsiderably as the width is increased both for the baseline ResNet v2 architecture as well as\nthe models trained with Swapout. Swapout is better able to use the available capacity than the\n\n6\n\n\fTable 3: Wider swapout models work better. We evaluate the effect of increasing the number of \ufb01lters\non CIFAR-10. ResNets [5] contain three groups of layers with all convolutional layers in a group\ncontaining equal number of \ufb01lters. We indicate the number of \ufb01lters in each group as a tuple below\nand report the performance with deterministic as well as stochastic inference with 30 samples. For\neach size, model trained with Swapout outperforms the corresponding ResNet model.\n\nModel\n\nWidth\n\n#Params ResNet v2\n\nSwapout\n\nSwapout v2 (20) W \u00d7 1\nSwapout v2 (20) W \u00d7 2\nSwapout v2 (20) W \u00d7 4\nSwapout v2 (32) W \u00d7 4\nTable 4: Swapout outperforms comparable methods on CIFAR-10. A 32 layer wider model performs\ncompetitively against a 1001 layer ResNet. Swapout and dropout use stochastic inference.\n\n(16, 32, 64)\n(32, 64, 128)\n(64, 128, 256)\n(64, 128, 256)\n\n0.27M\n1.09M\n4.33M\n7.43M\n\n8.27\n6.54\n5.62\n5.23\n\nDeterministic\n8.58\n6.40\n5.43\n4.97\n\nStochastic\n7.92\n5.68\n5.09\n4.76\n\nMethod\nDropConnect [23]\nNIN [13]\nFitNet(19) [17]\nDSN [12]\nHighway[21]\nResNet v1(110) [5]\nStochastic Depth v1(1202) [7]\nSwapOut v1(20) W \u00d7 2\nResNet v2(1001) [6]\nDropout v2(32) W \u00d7 4\nSwapOut v2(32) W \u00d7 4\n\n#Params Error(%)\n-\n-\n-\n-\n-\n1.7M\n19.4M\n1.09M\n10.2M\n7.43M\n7.43M\n\n9.32\n8.81\n8.39\n7.97\n7.60\n6.41\n4.91\n6.58\n4.92\n4.83\n4.76\n\ncorresponding ResNet with similar architecture and number of parameters. Table 4 compares models\ntrained with Swapout with other approaches on CIFAR-10 while Table 5 compares on CIFAR-100.\nOn both datasets our shallower but wider model compares well with 1001 layer ResNet model.\n\nSwapout uses parameters ef\ufb01ciently: Persistently over tables 1, 3, and 4, swapout models with\nfewer parameters outperform other comparable models. For example, Swapout v2(32) W \u00d7 4 gets\n4.76% with 7.43M parameters in comparison to the ResNet version at 4.91% with 10.2M parameters.\n\nExperiments on CIFAR-100 con\ufb01rm our results: Table 5 shows that Swapout is very effective\nas it improves the performance of a 20 layer model (ResNet Ours) by more than 2%. Widening\nthe network and reducing the stochasticity leads to further improvements. Further, a wider but\nrelatively shallow model trained with Swapout (22.72%; 7.46M params) is competitive with the best\nperforming, very deep (1001 layer) latest ResNet model (22.71%;10.2M params).\n\n5 Discussion and future work\n\nSwapout is a stochastic training method that shows reliable improvements in performance and leads\nto networks that use parameters ef\ufb01ciently. Relatively shallow swapout networks give comparable\nperformance to extremely deep residual networks.\nPreliminary experiments on ImageNet [18] using swapout (Linear(1,0.8)) yield 28.7%/9.2% top-\n1/top-5 validation error while the corresponding ResNet-152 yields 22.4%/5.8% validation errors.\nWe noticed that stochasticity is a dif\ufb01cult hyper-parameter for deeper networks and a better setting\nwould likely improve results.\nWe have shown that different stochastic training schedules produce different behaviors, but have not\nsearched for the best schedule in any systematic way. It may be possible to obtain improvements by\ndoing so. We have described an extremely general swapout mechanism. It is straightforward using\n\n7\n\n\fTable 5: Swapout is strongly competitive with the best methods on CIFAR-100, and uses parameters\nef\ufb01ciently in comparison. A 20 layer model (Swapout v2 (20)) trained with Swapout improves upon\nthe corresponding 20 layer ResNet model (ResNet v2 Ours (20)). Further, a 32 layer wider model\nperforms competitively against a 1001 layer ResNet (last row). Swapout uses stochastic inference.\n\nMethod\nNIN [13]\nDSN [12]\nFitNet [17]\nHighway [21]\nResNet v1 (110) [5]\nStochastic Depth v1 (110) [7]\nResNet v2 (164) [6]\nResNet v2 (1001) [6]\nResNet v2 Ours (20) W \u00d7 2\nSwapOut v2 (20)(Linear(1,0.5)) W \u00d7 2\nSwapOut v2 (56)(Linear(1,0.5)) W \u00d7 2\nSwapOut v2 (56)(Linear(1,0.8)) W \u00d7 2\nSwapOut v2 (32)(Linear(1,0.8)) W \u00d7 4\n\n#Params Error(%)\n-\n-\n-\n-\n1.7M\n1.7M\n1.7M\n10.2M\n1.09M\n1.10M\n3.43M\n3.43M\n7.46M\n\n35.68\n34.57\n35.04\n32.39\n27.22\n24.58\n24.33\n22.71\n28.08\n25.86\n24.86\n23.46\n22.72\n\n9\n\n8\n\n7\n\n6\n\n\u2192\n\ne\nt\na\nr\n\nr\no\nr\nr\ne\n\nn\na\ne\n\nM\n\n\u03b81 = \u03b82 = Linear(1, 0.5)\n\u03b81 = \u03b82 = 0.5\n\n0.15\n\n\u2192\n\nr\no\nr\nr\ne\n\nd\nr\na\nd\nn\na\nt\nS\n\n0.1\n\n0.05\n\n0\n\n10\n\n20\n\n30\n\n0\n\n10\n\n20\n\n30\n\nNumber of samples \u2192\n\nNumber of samples \u2192\n\nFigure 2: Stochastic inference needs few samples for a good estimate. We plot the mean error rate\non the left as a function of the number of samples for two stochastic training schedules. Standard\nerror of the mean is shown as the shaded interval on the left and magni\ufb01ed in the right plot. It is\nevident that relatively few samples are needed for a reliable estimate of the mean error. The mean\nand standard error was computed using 30 repetitions for each sample count. Note that stochastic\ninference quickly overtakes accuracies for deterministic inference in very few samples (2-3)(Table 2).\n\nequation 2 to apply swapout to inception networks [22] (by using several different functions of the\ninput and a suf\ufb01ciently general form of convolution); to recurrent convolutional networks [15] (by\nchoosing Fi to have the form F \u25e6 F \u25e6 F . . .); and to gated networks. All our experiments focus on\ncomparisons to residual networks because these are the current top performers on CIFAR-10 and\nCIFAR-100. It would be interesting to experiment with other versions of the method.\nAs with dropout and batch normalization, it is dif\ufb01cult to give a crisp explanation of why swapout\nworks. We believe that swapout causes some form of improvement in the optimization process. This\nis because relatively shallow networks with swapout reliably work as well as or better than quite\ndeep alternatives; and because swapout is notably and reliably more ef\ufb01cient in its use of parameters\nthan comparable deeper networks. Unlike dropout, swapout will often propagate gradients while\nstill forcing units not to co-adapt. Furthermore, our swapout networks involve some form of tying\nbetween layers. When a unit sometimes sees layer i and sometimes layer i \u2212 j, the gradient signal\nwill be exploited to encourage the two layers to behave similarly. The reason swapout is successful\nlikely involves both of these points.\n\nAcknowledgments: This work is supported in part by ONR MURI Awards N00014-10-1-0934 and N00014-\n16-1-2007. We would like to thank NVIDIA for donating some of the GPUs used in this work.\n\n8\n\n\fReferences\n[1] Y. Bengio and O. Delalleau. On the expressive power of deep architectures. In Proceedings of the 22nd\n\nInternational Conference on Algorithmic Learning Theory, 2011.\n\n[2] Y. Gal and Z. Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational\n\ninference. 2015.\n\n[3] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In AISTATS, 2011.\n[4] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent.\n\nCoRR, abs/1509.01240, 2015.\n\n[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,\n\n[6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027,\n\n[7] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. CoRR,\n\n2015.\n\n2016.\n\nabs/1603.09382, 2016.\n\n[8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. CoRR, abs/1502.03167, 2015.\n\n[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[10] D. Krueger, T. Maharaj, J. Kram\u00e1r, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle,\nA. Courville, et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint\narXiv:1606.01305, 2016.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[12] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. AISTATS, 2015.\n[13] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.\n[14] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, 2010.\n[15] P. H. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. arXiv preprint\n\n[16] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[17] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep\n\narXiv:1306.2795, 2013.\n\nnets. ICLR, 2015.\n\n[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 2015.\n\n[19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 2014.\n[21] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.\n[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-\n\nnovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.\n\n[23] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect.\n\n[24] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks.\n\nIn ICML, pages 1058\u20131066, 2013.\n\narXiv preprint arXiv:1301.3557, 2013.\n\n9\n\n\f", "award": [], "sourceid": 13, "authors": [{"given_name": "Saurabh", "family_name": "Singh", "institution": "UIUC"}, {"given_name": "Derek", "family_name": "Hoiem", "institution": "UIUC"}, {"given_name": "David", "family_name": "Forsyth", "institution": "UIUC"}]}