{"title": "Residual Networks Behave Like Ensembles of Relatively Shallow Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 558, "abstract": "In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.", "full_text": "Residual Networks Behave Like Ensembles of\n\nRelatively Shallow Networks\n\nAndreas Veit\n\nMichael Wilber\n\nSerge Belongie\n\nDepartment of Computer Science & Cornell Tech\n\nCornell University\n\n{av443, mjw285, sjb344}@cornell.edu\n\nAbstract\n\nIn this work we propose a novel interpretation of residual networks showing that\nthey can be seen as a collection of many paths of differing length. Moreover,\nresidual networks seem to enable very deep networks by leveraging only the short\npaths during training. To support this observation, we rewrite residual networks as\nan explicit collection of paths. Unlike traditional models, paths through residual\nnetworks vary in length. Further, a lesion study reveals that these paths show\nensemble-like behavior in the sense that they do not strongly depend on each other.\nFinally, and most surprising, most paths are shorter than one might expect, and\nonly the short paths are needed during training, as longer paths do not contribute\nany gradient. For example, most of the gradient in a residual network with 110\nlayers comes from paths that are only 10-34 layers deep. Our results reveal one\nof the key characteristics that seem to enable the training of very deep networks:\nResidual networks avoid the vanishing gradient problem by introducing short paths\nwhich can carry gradient throughout the extent of very deep networks.\n\n1\n\nIntroduction\n\nMost modern computer vision systems follow a familiar architecture, processing inputs from low-\nlevel features up to task speci\ufb01c high-level features. Recently proposed residual networks [5, 6]\nchallenge this conventional view in three ways. First, they introduce identity skip-connections that\nbypass residual layers, allowing data to \ufb02ow from any layers directly to any subsequent layers. This\nis in stark contrast to the traditional strictly sequential pipeline. Second, skip connections give rise to\nnetworks that are two orders of magnitude deeper than previous models, with as many as 1202 layers.\nThis is contrary to architectures like AlexNet [13] and even biological systems [17] that can capture\ncomplex concepts within half a dozen layers.1 Third, in initial experiments, we observe that removing\nsingle layers from residual networks at test time does not noticeably affect their performance. This\nis surprising because removing a layer from a traditional architecture such as VGG [18] leads to a\ndramatic loss in performance.\nIn this work we investigate the impact of these differences. To address the in\ufb02uence of identity skip-\nconnections, we introduce the unraveled view. This novel representation shows residual networks\ncan be viewed as a collection of many paths instead of a single deep network. Further, the perceived\nresilience of residual networks raises the question whether the paths are dependent on each other or\nwhether they exhibit a degree of redundancy. To \ufb01nd out, we perform a lesion study. The results show\nensemble-like behavior in the sense that removing paths from residual networks by deleting layers or\ncorrupting paths by reordering layers only has a modest and smooth impact on performance. Finally,\nwe investigate the depth of residual networks. Unlike traditional models, paths through residual\nnetworks vary in length. The distribution of path lengths follows a binomial distribution, meaning\n\n1Making the common assumption that a layer in a neural network corresponds to a cortical area.\n\n\fthat the majority of paths in a network with 110 layers are only about 55 layers deep. Moreover, we\nshow most gradient during training comes from paths that are even shorter, i.e., 10-34 layers deep.\nThis reveals a tension. On the one hand, residual network performance improves with adding more\nand more layers [6]. However, on the other hand, residual networks can be seen as collections of\nmany paths and the only effective paths are relatively shallow. Our results could provide a \ufb01rst\nexplanation: residual networks do not resolve the vanishing gradient problem by preserving gradient\n\ufb02ow throughout the entire depth of the network. Rather, they enable very deep networks by shortening\nthe effective paths. For now, short paths still seem necessary to train very deep networks.\nIn this paper we make the following contributions:\n\na collection of many paths, instead of a single ultra-deep network.\n\n\u2022 We introduce the unraveled view, which illustrates that residual networks can be viewed as\n\u2022 We perform a lesion study to show that these paths do not strongly depend on each other,\neven though they are trained jointly. Moreover, they exhibit ensemble-like behavior in the\nsense that their performance smoothly correlates with the number of valid paths.\n\u2022 We investigate the gradient \ufb02ow through residual networks, revealing that only the short\n\npaths contribute gradient during training. Deep paths are not required during training.\n\n2 Related Work\n\nThe sequential and hierarchical computer vision pipeline Visual processing has long been un-\nderstood to follow a hierarchical process from the analysis of simple to complex features. This\nformalism is based on the discovery of the receptive \ufb01eld [10], which characterizes the visual system\nas a hierarchical and feedforward system. Neurons in early visual areas have small receptive \ufb01elds\nand are sensitive to basic visual features, e.g., edges and bars. Neurons in deeper layers of the\nhierarchy capture basic shapes, and even deeper neurons respond to full objects. This organization\nhas been widely adopted in the computer vision and machine learning literature, from early neural\nnetworks such as the Neocognitron [4] and the traditional hand-crafted feature pipeline of Malik and\nPerona [15] to convolutional neural networks [13, 14]. The recent strong results of very deep neural\nnetworks [18, 20] led to the general perception that it is the depth of neural networks that govern their\nexpressive power and performance. In this work, we show that residual networks do not necessarily\nfollow this tradition.\nResidual networks [5, 6] are neural networks in which each layer consists of a residual module\nfi and a skip connection2 bypassing fi. Since layers in residual networks can comprise multiple\nconvolutional layers, we refer to them as residual blocks in the remainder of this paper. For clarity of\nnotation, we omit the initial pre-processing and \ufb01nal classi\ufb01cation steps. With yi\u22121 as is input, the\noutput of the ith block is recursively de\ufb01ned as\n\nyi \u2261 fi(yi\u22121) + yi\u22121,\n\n(1)\nwhere fi(x) is some sequence of convolutions, batch normalization [11], and Recti\ufb01ed Linear Units\n(ReLU) as nonlinearities. Figure 1 (a) shows a schematic view of this architecture. In the most recent\nformulation of residual networks [6], fi(x) is de\ufb01ned by\nfi(x) \u2261 Wi \u00b7 \u03c3 (B (W (cid:48)\n\n(2)\ni are weight matrices, \u00b7 denotes convolution, B(x) is batch normalization and\nwhere Wi and W (cid:48)\n\u03c3(x) \u2261 max(x, 0). Other formulations are typically composed of the same operations, but may differ\nin their order.\nThe idea of branching paths in neural networks is not new. For example, in the regime of convolutional\nneural networks, models based on inception modules [20] were among the \ufb01rst to arrange layers in\nblocks with parallel paths rather than a strict sequential order. We choose residual networks for this\nstudy because of their simple design principle.\nHighway networks Residual networks can be viewed as a special case of highway networks [19].\nThe output of each layer of a highway network is de\ufb01ned as\n\ni \u00b7 \u03c3 (B (x)))) ,\n\nyi+1 \u2261 fi+1(yi) \u00b7 ti+1(yi) + yi \u00b7 (1 \u2212 ti+1(yi))\n\n(3)\n\n2We only consider identity skip connections, but this framework readily generalizes to more complex\n\nprojection skip connections when downsampling is required.\n\n2\n\n\f(a) Conventional 3-block residual network\n\n=\n\n(b) Unraveled view of (a)\n\nFigure 1: Residual Networks are conventionally shown as (a), which is a natural representation of\nEquation (1). When we expand this formulation to Equation (6), we obtain an unraveled view of a\n3-block residual network (b). Circular nodes represent additions. From this view, it is apparent that\nresidual networks have O(2n) implicit paths connecting input and output and that adding a block\ndoubles the number of paths.\n\nThis follows the same structure as Equation (1). Highway networks also contain residual modules\nand skip connections that bypass them. However, the output of each path is attenuated by a gating\nfunction t, which has learned parameters and is dependent on its input. Highway networks are\nequivalent to residual networks when ti(\u00b7) = 0.5, in which case data \ufb02ows equally through both\npaths. Given an omnipotent solver, highway networks could learn whether each residual module\nshould affect the data. This introduces more parameters and more complexity.\nInvestigating neural networks Several investigative studies seek to better understand convolutional\nneural networks. For example, Zeiler and Fergus [23] visualize convolutional \ufb01lters to unveil the\nconcepts learned by individual neurons. Further, Szegedy et al. [21] investigate the function learned\nby neural networks and how small changes in the input called adversarial examples can lead to large\nchanges in the output. Within this stream of research, the closest study to our work is from Yosinski\net al. [22], which performs lesion studies on AlexNet. They discover that early layers exhibit little\nco-adaptation and later layers have more co-adaptation. These papers, along with ours, have the\ncommon thread of exploring speci\ufb01c aspects of neural network performance. In our study, we focus\nour investigation on structural properties of neural networks.\nEnsembling Since the early days of neural networks, researchers have used simple ensembling\ntechniques to improve performance. Though boosting has been used in the past [16], one simple\napproach is to arrange a committee [3] of neural networks in a simple voting scheme, where the\n\ufb01nal output predictions are averaged. Top performers in several competitions use this technique\nalmost as an afterthought [6, 13, 18]. Generally, one key characteristic of ensembles is their smooth\nperformance with respect to the number of members. In particular, the performance increase from\nadditional ensemble members gets smaller with increasing ensemble size. Even though they are not\nstrict ensembles, we show that residual networks behave similarly.\nDropout Hinton et al. [7] show that dropping out individual neurons during training leads to a\nnetwork that is equivalent to averaging over an ensemble of exponentially many networks. Similar\nin spirit, stochastic depth [9] trains an ensemble of networks by dropping out entire layers during\ntraining. In this work, we show that one does not need a special training strategy such as stochastic\ndepth to drop out layers. Entire layers can be removed from plain residual networks without impacting\nperformance, indicating that they do not strongly depend on each other.\n\n3 The unraveled view of residual networks\n\nTo better understand residual networks, we introduce a formulation that makes it easier to reason\nabout their recursive nature. Consider a residual network with three building blocks from input y0 to\noutput y3. Equation (1) gives a recursive de\ufb01nition of residual networks. The output of each stage is\nbased on the combination of two subterms. We can make the shared structure of the residual network\napparent by unrolling the recursion into an exponential number of nested terms, expanding one layer\n\n3\n\n\f(a) Deleting f2 from unraveled view\n\n(b) Ordinary feedforward network\n\nFigure 2: Deleting a layer in residual networks at test time (a) is equivalent to zeroing half of the\npaths. In ordinary feed-forward networks (b) such as VGG or AlexNet, deleting individual layers\nalters the only viable path from input to output.\n\nat each substitution step:\n\ny3 = y2 + f3(y2)\n\n= [y1 + f2(y1)] + f3(y1 + f2(y1))\n\n(cid:0)y0 + f1(y0) + f2(y0 + f1(y0))(cid:1)\n\n=(cid:2)y0 + f1(y0) + f2(y0 + f1(y0))(cid:3) + f3\n\n(4)\n(5)\n(6)\nWe illustrate this expression tree graphically in Figure 1 (b). With subscripts in the function modules\nindicating weight sharing, this graph is equivalent to the original formulation of residual networks.\nThe graph makes clear that data \ufb02ows along many paths from input to output. Each path is a unique\ncon\ufb01guration of which residual module to enter and which to skip. Conceivably, each unique path\nthrough the network can be indexed by a binary code b \u2208 {0, 1}n where bi = 1 iff the input \ufb02ows\nthrough residual module fi and 0 if fi is skipped. It follows that residual networks have 2n paths\nconnecting input to output layers.\nIn the classical visual hierarchy, each layer of processing depends only on the output of the previous\nlayer. Residual networks cannot strictly follow this pattern because of their inherent structure.\nEach module fi(\u00b7) in the residual network is fed data from a mixture of 2i\u22121 different distributions\ngenerated from every possible con\ufb01guration of the previous i \u2212 1 residual modules.\nCompare this to a strictly sequential network such as VGG or AlexNet, depicted conceptually in\nFigure 2 (b). In these networks, input always \ufb02ows from the \ufb01rst layer straight through to the last in a\nsingle path. Written out, the output of a three-layer feed-forward network is\n\ni\n\ni\n\n(f F F\n\n1\n\n(y0)))\n\n3\n\n(f F F\n\n2\n\nyF F\n3 = f F F\n\nis only fed data from a single path con\ufb01guration, the output of f F F\n\n(7)\n(x) is typically a convolution followed by batch normalization and ReLU. In these\n\nwhere f F F\nnetworks, each f F F\nIt is worthwhile to note that ordinary feed-forward neural networks can also be \u201cunraveled\u201d using the\nabove thought process at the level of individual neurons rather than layers. This renders the network\nas a collection of different paths, where each path is a unique con\ufb01guration of neurons from each\nlayer connecting input to output. Thus, all paths through ordinary neural networks are of the same\nlength. However, paths in residual networks have varying length. Further, each path in a residual\nnetwork goes through a different subset of layers.\nBased on these observations, we formulate the following questions and address them in our experi-\nments below. Are the paths in residual networks dependent on each other or do they exhibit a degree\nof redundancy? If the paths do not strongly depend on each other, do they behave like an ensemble?\nDo paths of varying lengths impact the network differently?\n\ni\u22121(\u00b7).\n\n4 Lesion study\n\nIn this section, we use three lesion studies to show that paths in residual networks do not strongly\ndepend on each other and that they behave like an ensemble. All experiments are performed at test\n\n4\n\n\fFigure 3: Deleting individual layers from VGG\nand a residual network on CIFAR-10. VGG per-\nformance drops to random chance when any one\nof its layers is deleted, but deleting individual\nmodules from residual networks has a minimal\nimpact on performance. Removing downsam-\npling modules has a slightly higher impact.\n\nFigure 4: Results when dropping individual\nblocks from residual networks trained on Ima-\ngeNet are similar to CIFAR results. However,\ndownsampling layers tend to have more impact\non ImageNet.\n\ntime on CIFAR-10 [12]. Experiments on ImageNet [2] show comparable results. We train residual\nnetworks with the standard training strategy, dataset augmentation, and learning rate policy, [6]. For\nour CIFAR-10 experiments, we train a 110-layer (54-module) residual network with modules of the\n\u201cpre-activation\u201d type which contain batch normalization as \ufb01rst step. For ImageNet we use 200 layers\n(66 modules). It is important to note that we did not use any special training strategy to adapt the\nnetwork. In particular, we did not use any perturbations such as stochastic depth during training.\n\n4.1 Experiment: Deleting individual layers from neural networks at test time\n\nAs a motivating experiment, we will show that not all transformations within a residual network are\nnecessary by deleting individual modules from the neural network after it has been fully trained. To\ndo so, we remove the residual module from a single building block, leaving the skip connection (or\ndownsampling projection, if any) untouched. That is, we change yi = yi\u22121 + fi(yi\u22121) to y(cid:48)\ni = yi\u22121.\nWe can measure the importance of each building block by varying which residual module we remove.\nTo compare to conventional convolutional neural networks, we train a VGG network with 15 layers,\nsetting the number of channels to 128 for all layers to allow the removal of any layer.\nIt is unclear whether any neural network can withstand such a drastic change to the model structure.\nWe expect them to break because dropping any layer drastically changes the input distribution of all\nsubsequent layers.\nThe results are shown in Figure 3. As expected, deleting any layer in VGG reduces performance to\nchance levels. Surprisingly, this is not the case for residual networks. Removing downsampling\nblocks does have a modest impact on performance (peaks in Figure 3 correspond to downsampling\nbuilding blocks), but no other block removal lead to a noticeable change. This result shows that\nto some extent, the structure of a residual network can be changed at runtime without affecting\nperformance. Experiments on ImageNet show comparable results, as seen in Figure 4.\nWhy are residual networks resilient to dropping layers but VGG is not? Expressing residual networks\nin the unraveled view provides a \ufb01rst insight. It shows that residual networks can be seen as a\ncollection of many paths. As illustrated in Figure 2 (a), when a layer is removed, the number of\npaths is reduced from 2n to 2n\u22121, leaving half the number of paths valid. VGG only contains a\nsingle usable path from input to output. Thus, when a single layer is removed, the only viable path is\ncorrupted. This result suggests that paths in a residual network do not strongly depend on each other\nalthough they are trained jointly.\n\n4.2 Experiment: Deleting many modules from residual networks at test-time\n\nHaving shown that paths do not strongly depend on each other, we investigate whether the collection\nof paths shows ensemble-like behavior. One key characteristic of ensembles is that their performance\n\n5\n\n01020304050dropped layer index0.00.20.40.60.81.0Test classification errorTest error when dropping any single block from residual network vs. VGG on CIFAR-10residual network v2, 110 layersVGG network, 15 layersresidual network baselineVGG network baseline0102030405060dropped layer index0.00.20.40.60.81.0top 1 errorTop-1 error when dropping any single block from 200-layer residual network on ImageNetresidual network v2, 200 layersresidual network baseline\f(a)\n\n(b)\n\nFigure 5: (a) Error increases smoothly when randomly deleting several modules from a residual\nnetwork. (b) Error also increases smoothly when re-ordering a residual network by shuf\ufb02ing building\nblocks. The degree of reordering is measured by the Kendall Tau correlation coef\ufb01cient. These results\nare similar to what one would expect from ensembles.\n\ndepends smoothly on the number of members. If the collection of paths were to behave like an\nensemble, we would expect test-time performance of residual networks to smoothly correlate with\nthe number of valid paths. This is indeed what we observe: deleting increasing numbers of residual\nmodules, increases error smoothly, Figure 5 (a). This implies residual networks behave like ensembles.\nWhen deleting k residual modules from a network originally of length n, the number of valid paths\ndecreases to O(2n\u2212k). For example, the original network started with 54 building blocks, so deleting\n10 blocks leaves 244 paths. Though the collection is now a factor of roughly 10\u22126 of its original size,\nthere are still many valid paths and error remains around 0.2.\n\n4.3 Experiment: Reordering modules in residual networks at test-time\n\nOur previous experiments were only about dropping layers, which have the effect of removing\npaths from the network. In this experiment, we consider changing the structure of the network\nby re-ordering the building blocks. This has the effect of removing some paths and inserting new\npaths that have never been seen by the network during training. In particular, it moves high-level\ntransformations before low-level transformations.\nTo re-order the network, we swap k randomly sampled pairs of building blocks with compatible\ndimensionality, ignoring modules that perform downsampling. We graph error with respect to the\nKendall Tau rank correlation coef\ufb01cient which measures the amount of corruption. The results are\nshown in Figure 5 (b). As corruption increases, the error smoothly increases as well. This result is\nsurprising because it suggests that residual networks can be recon\ufb01gured to some extent at runtime.\n\n5 The importance of short paths in residual networks\n\nNow that we have seen that there are many paths through residual networks and that they do not\nnecessarily depend on each other, we investigate their characteristics.\nDistribution of path lengths Not all paths through residual networks are of the same length. For\nexample, there is precisely one path that goes through all modules and n paths that go only through a\nsingle module. From this reasoning, the distribution of all possible path lengths through a residual\nnetwork follows a Binomial distribution. Thus, we know that the path lengths are closely centered\naround the mean of n/2. Figure 6 (a) shows the path length distribution for a residual network with\n54 modules; more than 95% of paths go through 19 to 35 modules.\nVanishing gradients in residual networks Generally, data \ufb02ows along all paths in residual networks.\nHowever, not all paths carry the same amount of gradient. In particular, the length of the paths through\nthe network affects the gradient magnitude during backpropagation [1, 8]. To empirically investigate\nthe effect of vanishing gradients on residual networks we perform the following experiment. Starting\nfrom a trained network with 54 blocks, we sample individual paths of a certain length and measure\nthe norm of the gradient that arrives at the input. To sample a path of length k, we \ufb01rst feed a batch\nforward through the whole network. During the backward pass, we randomly sample k residual\n\n6\n\n1234567891011121314151617181920Number of layers deleted0.00.10.20.30.40.50.60.70.80.9ErrorError when deleting layers1.00.980.960.940.920.90.880.860.84Kendall Tau correlation0.00.10.20.30.40.50.60.70.80.9ErrorError when permuting layers\f(a)\n\n(b)\n\n(c)\n\nFigure 6: How much gradient do the paths of different lengths contribute in a residual network?\nTo \ufb01nd out, we \ufb01rst show the distribution of all possible path lengths (a). This follows a Binomial\ndistribution. Second, we record how much gradient is induced on the \ufb01rst layer of the network\nthrough paths of varying length (b), which appears to decay roughly exponentially with the number\nof modules the gradient passes through. Finally, we can multiply these two functions (c) to show how\nmuch gradient comes from all paths of a certain length. Though there are many paths of medium\nlength, paths longer than \u223c20 modules are generally too long to contribute noticeable gradient during\ntraining. This suggests that the effective paths in residual networks are relatively shallow.\nblocks. For those k blocks, we only propagate through the residual module; for the remaining n \u2212 k\nblocks, we only propagate through the skip connection. Thus, we only measure gradients that \ufb02ow\nthrough the single path of length k. We sample 1,000 measurements for each length k using random\nbatches from the training set. The results show that the gradient magnitude of a path decreases\nexponentially with the number of modules it went through in the backward pass, Figure 6 (b).\nThe effective paths in residual networks are relatively shallow Finally, we can use these results\nto deduce whether shorter or longer paths contribute most of the gradient during training. To \ufb01nd the\ntotal gradient magnitude contributed by paths of each length, we multiply the frequency of each path\nlength with the expected gradient magnitude. The result is shown in Figure 6 (c). Surprisingly, almost\nall of the gradient updates during training come from paths between 5 and 17 modules long. These\nare the effective paths, even though they constitute only 0.45% of all paths through this network.\nMoreover, in comparison to the total length of the network, the effective paths are relatively shallow.\nTo validate this result, we retrain a residual network from scratch that only sees the effective paths\nduring training. This ensures that no long path is ever used. If the retrained model is able to perform\ncompetitively compared to training the full network, we know that long paths in residual networks\nare not needed during training. We achieve this by only training a subset of the modules during each\nmini batch. In particular, we choose the number of modules such that the distribution of paths during\ntraining aligns with the distribution of the effective paths in the whole network. For the network\nwith 54 modules, this means we sample exactly 23 modules during each training batch. Then, the\npath lengths during training are centered around 11.5 modules, well aligned with the effective paths.\nIn our experiment, the network trained only with the effective paths achieves a 5.96% error rate,\nwhereas the full model achieves a 6.10% error rate. There is no statistically signi\ufb01cant difference.\nThis demonstrates that indeed only the effective paths are needed.\n\n6 Discussion\n\nRemoving residual modules mostly removes long paths Deleting a module from a residual network\nmainly removes the long paths through the network. In particular, when deleting d residual modules\nfrom a network of length n, the fraction of paths remaining per path length x is given by\n\n(cid:0)n\u2212d\n(cid:1)\n(cid:1)\n(cid:0)n\n\nx\n\nx\n\nfraction of remaining paths of length x =\n\n(8)\n\nFigure 7 illustrates the fraction of remaining paths after deleting 1, 10 and 20 modules from a 54\nmodule network. It becomes apparent that the deletion of residual modules mostly affects the long\npaths. Even after deleting 10 residual modules, many of the effective paths between 5 and 17 modules\nlong are still valid. Since mainly the effective paths are important for performance, this result is in line\nwith the experiment shown in Figure 5 (a). Performance only drops slightly up to the removal of 10\nresidual modules, however, for the removal of 20 modules, we observe a severe drop in performance.\n\n7\n\n01020304050pathlength0.00.51.01.52.0numberofpaths1e15distributionofpathlength01020304050pathlength0.00.20.40.60.81.0gradmagnitudeatinput1e5gradientmagnitudeperpathlength0102030405010-2610-2410-2210-2010-1810-1610-1410-1210-1010-810-6loggradientmagnitudegradmagnitudeatinputpathlength01020304050pathlength0.00.10.20.30.40.5totalgradientmagnitudetotalgradientmagnitudeperpathlength\fFigure 7: Fraction of paths remain-\ning after deleting individual layers.\nDeleting layers mostly affects long\npaths through the networks.\n\nFigure 8: Impact of stochastic depth on resilience to layer\ndeletion. Training with stochastic depth only improves re-\nsilience slightly, indicating that plain residual networks al-\nready don\u2019t depend on individual layers. Compare to Fig. 3.\n\nConnection to highway networks In highway networks, ti(\u00b7) multiplexes data \ufb02ow through the\nresidual and skip connections and ti(\u00b7) = 0.5 means both paths are used equally. For highway\nnetworks in the wild, [19] observe empirically that the gates commonly deviate from ti(\u00b7) = 0.5. In\nparticular, they tend to be biased toward sending data through the skip connection; in other words, the\nnetwork learns to use short paths. Similar to our results, it reinforces the importance of short paths.\nEffect of stochastic depth training procedure Recently, an alternative training procedure for resid-\nual networks has been proposed, referred to as stochastic depth [9]. In that approach a random subset\nof the residual modules is selected for each mini-batch during training. The forward and backward\npass is only performed on those modules. Stochastic depth does not affect the number of paths in the\nnetwork because all paths are available at test time. However, it changes the distribution of paths seen\nduring training. In particular, mainly short paths are seen. Further, by selecting a different subset of\nshort paths in each mini-batch, it encourages the paths to produce good results independently.\nDoes this training procedure signi\ufb01cantly reduce the dependence between paths? We repeat the\nexperiment of deleting individual modules for a residual network trained using stochastic depth. The\nresult is shown in Figure 8. Training with stochastic depth improves resilience slightly; only the\ndependence on the downsampling layers seems to be reduced. By now, this is not surprising: we\nknow that plain residual networks already don\u2019t depend on individual layers.\n\n7 Conclusion\n\nWhat is the reason behind residual networks\u2019 increased performance? In the most recent iteration of\nresidual networks, He et al. [6] provide one hypothesis: \u201cWe obtain these results via a simple but\nessential concept\u2014going deeper.\u201d While it is true that they are deeper than previous approaches, we\npresent a complementary explanation. First, our unraveled view reveals that residual networks can be\nviewed as a collection of many paths, instead of a single ultra deep network. Second, we perform\nlesion studies to show that, although these paths are trained jointly, they do not strongly depend\non each other. Moreover, they exhibit ensemble-like behavior in the sense that their performance\nsmoothly correlates with the number of valid paths. Finally, we show that the paths through the\nnetwork that contribute gradient during training are shorter than expected. In fact, deep paths are\nnot required during training as they do not contribute any gradient. Thus, residual networks do not\nresolve the vanishing gradient problem by preserving gradient \ufb02ow throughout the entire depth of\nthe network. This insight reveals that depth is still an open research question. These promising\nobservations provide a new lens through which to examine neural networks.\n\nAcknowledgements\n\nWe would like to thank Sam Kwak and Theofanis Karaletsos for insightful feedback. We also thank\nthe reviewers of NIPS 2016 for their very constructive and helpful feedback and for suggesting\nthe paper title. This work is partly funded by AOL through the Connected Experiences Laboratory\n(Author 1), an NSF Graduate Research Fellowship award (NSF DGE-1144153, Author 2), and a\nGoogle Focused Research award (Author 3).\n\n8\n\n01020304050path length0.00.20.40.60.81.0fraction of remaining pathseffective pathsremaining paths after deleting d modulesdelete 1 moduledelete 10 modulesdelete 20 modules01020304050dropped layer index0.00.20.40.60.81.0Test classification errorResidual network vs. stochastic depth error when dropping any single block (CIFAR-10)residual network v2, 110 layersstochastic depth, 110 layers, d = 0.5 linear decay\fReferences\n[1] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with\n\ngradient descent is dif\ufb01cult. IEEE Transactions on Neural Networks, 5(2):157\u2013166, 1994.\n\n[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.\n[3] Harris Drucker, Corinna Cortes, Lawrence D. Jackel, Yann LeCun, and Vladimir Vapnik.\n\nBoosting and other ensemble methods. Neural Computation, 6(6):1289\u20131301, 1994.\n\n[4] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism\nof pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193\u2013202,\n1980.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. arXiv preprint arXiv:1512.03385, 2015.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. arXiv preprint arXiv:1603.05027, 2016.\n\n[7] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv\npreprint arXiv:1207.0580, 2012.\n\n[8] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master\u2019s thesis, Institut\n\nfur Informatik, Technische Universitat, Munchen, 1991.\n\n[9] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with\n\nstochastic depth. arXiv preprint arXiv:1603.09382, 2016.\n\n[10] David H Hubel and Torsten N Wiesel. Receptive \ufb01elds, binocular interaction and functional\n\narchitecture in the cat\u2019s visual cortex. The Journal of Physiology, 160(1):106\u2013154, 1962.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, 2015.\n\n[12] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, 2012.\n[14] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[15] Jitendra Malik and Pietro Perona. Preattentive texture discrimination with early vision mecha-\n\nnisms. Journal of the Optical Society of America, 1990.\n\n[16] Robert E Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227, 1990.\n[17] Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid\ncategorization. Proceedings of the National Academy of Sciences, 104(15):6424\u20136429, 2007.\n[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[19] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. arXiv\n\npreprint arXiv:1505.00387, 2015.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[21] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[22] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\n\ndeep neural networks? In Advances in Neural Information Processing Systems, 2014.\n\n[23] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nComputer Vision\u2013ECCV 2014, pages 818\u2013833. Springer, 2014.\n\n9\n\n\f", "award": [], "sourceid": 304, "authors": [{"given_name": "Andreas", "family_name": "Veit", "institution": "Cornell University"}, {"given_name": "Michael", "family_name": "Wilber", "institution": "Cornell Tech"}, {"given_name": "Serge", "family_name": "Belongie", "institution": "Cornell University"}]}