{"title": "LCA: Loss Change Allocation for Neural Network Training", "book": "Advances in Neural Information Processing Systems", "page_first": 3619, "page_last": 3629, "abstract": "Neural networks enjoy widespread use, but many aspects of their training, representation, and operation are poorly understood. In particular, our view into the training process is limited, with a single scalar loss being the most common viewport into this high-dimensional, dynamic process. We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. This measurement is accomplished by decomposing the components of an approximate path integral along the training trajectory using a Runge-Kutta integrator. This rich view shows which parameters are responsible for decreasing or increasing the loss during training, or which parameters \"help\" or \"hurt\" the network's learning, respectively. LCA may be summed over training iterations and/or over neurons, channels, or layers for increasingly coarse views. This new measurement device produces several insights into training. (1) We find that barely over 50% of parameters help during any given iteration. (2) Some entire layers hurt overall, moving on average against the training gradient, a phenomenon we hypothesize may be due to phase lag in an oscillatory training process. (3) Finally, increments in learning proceed in a synchronized manner across layers, often peaking on identical iterations.", "full_text": "LCA: Loss Change Allocation for\n\nNeural Network Training\n\nJanice Lan\n\nUber AI\n\njanlan@uber.com\n\nRosanne Liu\n\nUber AI\n\nrosanne@uber.com\n\nHattie Zhou\n\nUber\n\nhattie@uber.com\n\nyosinski@uber.com\n\nJason Yosinski\n\nUber AI\n\nAbstract\n\nNeural networks enjoy widespread use, but many aspects of their training, represen-\ntation, and operation are poorly understood. In particular, our view into the training\nprocess is limited, with a single scalar loss being the most common viewport into\nthis high-dimensional, dynamic process. We propose a new window into training\ncalled Loss Change Allocation (LCA), in which credit for changes to the network\nloss is conservatively partitioned to the parameters. This measurement is accom-\nplished by decomposing the components of an approximate path integral along the\ntraining trajectory using a Runge-Kutta integrator. This rich view shows which\nparameters are responsible for decreasing or increasing the loss during training,\nor which parameters \u201chelp\u201d or \u201churt\u201d the network\u2019s learning, respectively. LCA\nmay be summed over training iterations and/or over neurons, channels, or layers\nfor increasingly coarse views. This new measurement device produces several\ninsights into training. (1) We \ufb01nd that barely over 50% of parameters help during\nany given iteration. (2) Some entire layers hurt overall, moving on average against\nthe training gradient, a phenomenon we hypothesize may be due to phase lag in\nan oscillatory training process. (3) Finally, increments in learning proceed in a\nsynchronized manner across layers, often peaking on identical iterations.\n\nIntroduction\n\n1\nIn the common stochastic gradient descent (SGD) training setup, a parameterized model is iteratively\nupdated using gradients computed from mini-batches of data chosen from some training set. Unfortu-\nnately, our view into the high-dimensional, dynamic training process is often limited to watching\na scalar loss quantity decrease over time. There has been much research attempting to understand\nneural network training, with some work studying geometric properties of the objective function\n[7, 20, 28, 24, 21], properties of whole networks and individual layers at convergence [4, 7, 15, 35],\nand neural network training from an optimization perspective [30, 4, 5, 3, 19]. This body of work in\naggregate provides rich insight into the loss landscape arising from typical combinations of neural\nnetwork architectures and datasets. Literature on the dynamics of the training process itself is more\nsparse, but a few salient works examine the learning phase through the diagonal of the Hessian,\nmutual information between input and output, and other measures [1, 25, 14].\nIn this paper we propose a simple approach to inspecting training in progress by decomposing changes\nin the overall network loss into a per-parameter Loss Change Allocation or LCA. The procedure for\ncomputing LCA is straightforward, but to our knowledge it has not previously been employed for\ninvestigating network training. We begin by de\ufb01ning this measure in more detail, and then apply it to\nreveal several interesting properties of neural network training. Our contributions are as follows:\n\n1. We de\ufb01ne the Loss Change Allocation as a per-parameter, per-iteration decomposition\nof changes to the overall network loss (Section 2). Exploring network training with this\nmeasurement tool uncovers the following insights.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) Illustration of this paper\u2019s method on a toy two-dimensional loss surface. We allocate\ncredit for changes to the model\u2019s training loss to individual parameters (b) \u03b8 dim-1 and (c) \u03b8 dim-2\nby multiplying parameter motion with the corresponding individual component of the gradient of\nthe training set. This partitions changes to the loss into individual Loss Change Allocation (LCA)\ncomponents allows us to measure which parameters learn at each timestep, providing a rich view into\nthe training process. In the example depicted, although both parameters move, the second parameter\ncaptures all the credit, as only its component of the gradient is non-zero.\n\n2. Learning is very noisy, with only slightly over half of parameters helping to reduce loss on\n\nany given iteration (Section 3).\n\n3. Some entire layers consistently drift in the wrong direction during training, on average\nmoving against the gradient. We propose and test an explanation that these layers are\nslightly out of phase, lagging behind other layers during training (Section 4).\n\n4. We contribute new evidence to suggest that the learning progress is, on a microscopic level,\nsynchronized across layers, with small peaks of learning often occurring at the same iteration\nfor all layers (Section 5).\n\n2 The Loss Change Allocation approach\nWe begin by de\ufb01ning the Loss Change Allocation approach in more detail. Consider a parameterized\ntraining scenario where a model starts at parameter value \u03b80 and ends at parameter value \u03b8T after\ntraining. The training process entails traversing some path P along the surface of a loss landscape\nfrom \u03b80 to \u03b8T . There are several loss landscapes one might consider; in this paper we analyze the\ntraining process, so we measure motion along the loss with respect to the entire training set, here\ndenoted simply L(\u03b8). We analyze the loss landscape of the training set instead of the validation\nset because we aim to measure training, not training confounded with issues of memorization vs.\ngeneralization (though the latter certainly should be the topic of future studies).\nThe approach in this paper derives from a straightforward application of the fundamental theorem of\ncalculus to a path integral along the loss landscape:\nL(\u03b8T ) \u2212 L(\u03b80) =\n\n(1)\nwhere C is any path from \u03b80 to \u03b8T and (cid:104)\u00b7,\u00b7(cid:105) is the dot product. This equation states that the change in\nloss from \u03b80 to \u03b8T may be calculated by integrating the dot product of the loss gradient and parameter\nmotion along a path from \u03b80 to \u03b8T . Because \u2207\u03b8L(\u03b8) is the gradient of a function and thus is a\nconservative \ufb01eld, any path from \u03b80 to \u03b8T may be used; in this paper we consider the path taken by\nthe optimizer during the course of training. We may approximate this path integral from \u03b80 to \u03b8T by\nusing a series of \ufb01rst order Taylor approximations along the training path. If we index training steps\nby t \u2208 [0, 1, ..., T ], the \ufb01rst order approximation for the change in loss during one step of training is\nthe following, rewritten as a sum of its individual components:\n\n(cid:104)\u2207\u03b8L(\u03b8), d\u03b8(cid:105)\n\n(cid:90)\n\nC\n\nL(\u03b8t+1) \u2212 L(\u03b8t) \u2248 (cid:104)\u2207\u03b8L(\u03b8t), \u03b8t+1 \u2212 \u03b8t(cid:105)\n(\u2207\u03b8L(\u03b8t))(i)(\u03b8(i)\n\nK\u22121(cid:88)\n\n=\n\nt+1 \u2212 \u03b8(i)\n\nt ) :=\n\nK\u22121(cid:88)\n\nAt,i\n\n(2)\n\n(3)\n\nwhere \u2207\u03b8L(\u03b8t) represents the gradient of the loss of the whole training set w.r.t. \u03b8 evaluated at \u03b8t,\nv(i) represents the i-th element of a vector v, and the parameter vector \u03b8 contains K elements. Note\nthat while we evaluate model learning by tracking progress along the training set loss landscape\n\ni=0\n\ni=0\n\n2\n\n\u2713dim-1(null)(null)(null)(null)(null)(null)\u2713dim-2(null)(null)(null)(null)(null)(null)loss(null)(null)(null)(null)(null)(null)(a)(null)(null)(null)(null)(null)(null)(b)(null)(null)(null)(null)(null)(null)pathtakenfrom\u03b80to\u03b8TcumulLCAgradient\u03b8dim-1\u03b8dim-2LCALCALCAcumul\u03b80\u03b8T\fL(\u03b8), training itself is accomplished using stochastic gradient approaches in which noisy gradients\nfrom mini-batches of data drive parameter updates via some optimizer like SGD or Adam. As shown\nin Equation 3, the difference in loss produced by one training iteration t may be decomposed into\nK individual Loss Change Allocation, or LCA, components, denoted At,i. These K components\nrepresent the LCA for a single iteration of training, and over the course of T iterations of training we\nwill collect a large T \u00d7 K matrix of At,i values.\nThe total loss over the course of training will often decrease, and the above decomposition allows\nus to allocate credit for loss decreases on a per-parameter, per-timestep level. Intuitively, when the\noptimizer increases the value of a parameter and its component of the gradient on the whole training\nset is negative, the parameter has a negative LCA and is \u201chelping\u201d or \u201clearning\u201d. Positive LCA is\n\u201churting\u201d the learning process, which may result from several causes: a noisy mini-batch with the\ngradient of that step going the wrong way, momentum, or a step size that is too large for a curvy or\nrugged loss landscape as seen in [14, 32]. If the parameter has a non-zero gradient but does not move,\nit does not affect the loss. Similarly, if a parameter moves but has zero gradient, it does not affect the\nloss. The sum of the K components is the overall change in loss at that iteration. Figure 1 depicts a\ntoy example using two parameters. Throughout the paper we use \u201chelping\u201d to indicate negative LCA\n(a contribution to the reduction of total loss), and \u201churting\u201d for positive LCA.\nAn important property of this decomposition is that it is grounded: the sum of individual components\nequals the total change in loss, and each contribution has the same fundamental units as the loss\noverall (e.g. nats or bits in the case of cross-entropy). This is in contrast to approaches that measure\nquantities like parameter motion or approximate elements of the Fisher information (FI) [16, 1],\nwhich also produce per-parameter measurements but depend heavily on the parameterization chosen.\nFor example, the FI metric is sensitive to scale (e.g. multiply one relu layer weights by 2 and next by\n0.5: loss stays the same but FI of each layer changes and total FI changes). Further, LCA has the\nbene\ufb01t of being signed, allowing us to make measurements and interpretations when training goes\nbackwards (Sections 3 and 4).\nIdeally, summing up the K components should equal L(\u03b8t+1) \u2212 L(\u03b8t). In practice, the \ufb01rst order\nTaylor approximation is often inaccurate due to the curvature of the loss landscape. We can improve\non our LCA approximation from Equation 2 by replacing \u2207\u03b8L(\u03b8t) with 1\n2 \u03b8t +\n2 \u03b8t+1) + \u2207\u03b8L(\u03b8t+1)), with the (1, 4, 1) coef\ufb01cients coming from the fourth-order Runge\u2013Kutta\n1\nmethod (RK4) [23, 17] or equivalently from Simpson\u2019s rule [31]. Using a midpoint gradient doubles\ncomputation but shrinks accumulated error drastically, from \ufb01rst order to fourth order. If the error is\nstill too large, we can halve the step size with composite Simpson\u2019s rule by calculating gradients at\n4 \u03b8t+1 as well. We halve the step size until the absolute error of change in\n3\n4 \u03b8t + 1\nloss per iteration is less than 0.001, and we ensure that the cumulative error at the end of training is\nless than 1%. First order and RK4 errors can be found in Table S1 in Supplementary Information.\nNote that the approach described may be applied to any parameterized model trained via gradient\ndescent, but for the remainder of the paper we assume the case of neural network training.\n\n6 (\u2207\u03b8L(\u03b8t) + 4\u2207\u03b8L( 1\n\n4 \u03b8t+1 and 1\n\n4 \u03b8t + 3\n\n2.1 Experiments\n\nWe employ the LCA approach to examine training on two tasks: MNIST and CIFAR-10, with\narchitectures including a 3-layer fully connected (FC) network and LeNet [18] on MNIST, and\nAllCNN [29] and ResNet-20 [9] on CIFAR-10. Throughout this paper we refer to training runs as\n\u201cdataset\u2013network\u201d, e.g., MNIST\u2013FC, MNIST\u2013LeNet, CIFAR\u2013AllCNN, CIFAR\u2013ResNet, followed\nby further con\ufb01guration details (such as the optimizer) when needed.\nFor each dataset\u2013network con\ufb01guration, we train with both SGD and Adam optimizers, and conduct\nmultiple runs with identical hyperparameter settings. Momentum of 0.9 is used for all SGD runs,\nexcept for one set of \u201cno-momentum\u201d MNIST\u2013FC experiments. Learning rates are manually chosen\nbetween 0.001 to 0.5. See Section S7 in Supplementary Information for more details on architectures\nand hyperparameters. We also make our code available at https://github.com/uber-research/\nloss-change-allocation. Note that we use standard network architectures to demonstrate use\ncases of our tool; we strive for simplicity and interpretability of results rather than state-of-the-art\nperformance. Thus we do not incorporate techniques such as L2 regularization, data augmentation,\nand learning rate decay. Since our method requires calculating gradients of the loss over the entire\ntraining set, it is considerably slower than the regular training process, but remains tractable for small\nto medium models; see Section S8 for more details on computation.\n\n3\n\n\fFigure 2: Frames from an animation of the learning process for two training runs. (left) The 1st layer\nof an MNIST\u2013FC (full shape is 100\u00d7784, but only the upper left quarter is shown for better clarity).\n(right) The 2nd convolutional layer of an MNIST\u2013LeNet (full shape is 40\u00d720 of 5\u00d75 blocks; only\nupper left quarter is shown). Each pixel represents one parameter. The LeNet layer shows 5\u00d75 grids\nrepresenting each \ufb01lter, laid out by input channels (columns) and output channels (rows). Parameters\nthat help (decrease the loss) at a given time are shown as shades of green. Parameters that hurt\n(increase the loss) are shown as shades of red. Larger magnitudes of LCA are darker and white\nindicates zero LCA. Iteration 20 is partly through the main drop in loss, and 220 is one full epoch. In\nMNIST\u2013FC, we can see clusters spaced at intervals of 28 pixels, because these parameters connect\nto the \ufb02attened MNIST images. Learning is strongest in early iterations with mostly negative LCA,\nremains strong for many iterations but with more variance in LCA across parameters, and has greatly\ndiminished by iteration 220, where much of learning is complete. The complete animations may be\nviewed at: https://youtu.be/xcnoRnoVyXQ and https://youtu.be/EY3LoXmdkYU.\n\n2.2 Direct visualization\n\nWe calculate LCA for every parameter at every iteration and animate the LCA values through all\nthe iterations in the whole training process. Figure 2 shows snapshots of frames from the video\nvisualization. In such videos, we arrange parameters \ufb01rst by layer and then for each layer as two-\ndimensional matrices (1-D vectors for biases), and overlay LCA values as a heatmap. This animation\nenables a granular view of the training process.\nWe can also directly visualize each parameter versus time, granting each parameter its own training\ncurve. We can optionally aggregate over neurons, channels, layers, etc. (see Section S2 for examples).\nA bene\ufb01t of these visualizations is that they convey a large volume of data directly to the viewer,\nsurfacing subtle patterns and bugs that can be further investigated. Observed patterns also suggest\nmore quantitative metrics that surface traits of training. The rest of the paper is dedicated to such\nmetrics and traits.\n\n3 Learning is very noisy\nAlthough it is a commonly held view that the inherent noise in SGD-based neural network training\nexists and is even considered bene\ufb01cial [15], this noise is often loosely de\ufb01ned as a deviation in\ngradient estimation. While the minibatch gradient serves as as a suggested direction for parameter\nmovement, it is still one step away from the actual impact on decreasing loss over the whole training\nset, which LCA represents precisely. By aggregating a population of per-parameter, per-iteration\nLCAs along different axes, we present numerical results that shed light into the noisy learning\nbehavior. We \ufb01nd it surprising that on average almost half of parameters are hurting in every training\niteration. Moreover, each parameter, including ones that help in total, hurt almost half of the time.\n\nTable 1: Percentage of helping parameters (ignoring those with zero LCA) for various networks and\noptimizers, averaged across all iterations and 3 independent runs per per con\ufb01guration.\n\nSGD\nAdam\n\n53.72 \u00b1 0.05\n\nN/A\n\nMNIST-FC, mom=0 MNIST-FC MNIST-LeNet CIFAR-ResNet CIFAR-AllCNN\n51.09 \u00b1 0.23\n50.19 \u00b1 0.01\n\n50.66 \u00b1 0.14\n50.30 \u00b1 0.004\n\n57.79 \u00b1 0.16\n55.82 \u00b1 0.09\n\n53.97 \u00b1 0.48\n51.77 \u00b1 0.21\n\n4\n\n050100150200250300350025iteration 1050100150200250300350025iteration 20050100150200250300350025iteration 220MNIST FC, SGD, 1st layer02040020406080iteration 102040020406080iteration 2002040020406080iteration 220MNIST LeNet, SGD, 2nd layer\f(a) Visualization of the percentage of parameters that helped, hurt, or had zero effect\nFigure 3:\nthrough training, overlaid with the loss curve of that run. (b) The distribution of helping and hurting\nLCA (zeros ignored) over the entire training, zoomed in to ignore 1% of tails. (c) Average percent of\nweights helping for each layer in network, curiously near 50% for all. (d) Histogram of the fraction\nof iterations each weight helped, showing that most weights swing back and forth between helping\nand hurting evenly. In every column the \ufb01rst row is MNIST\u2013FC and second row CIFAR\u2013ResNet,\nboth trained with SGD. Notable facts: MNIST\u2013FC shows a signi\ufb01cant percent of weights with zero\neffect. Because MNIST has pixels that are never on, any \ufb01rst layer weights connected to those pixels\ncannot help or hurt. CIFAR\u2013ResNet exhibits barely over 50% of parameters helping over the course\nof training, even during the period of signi\ufb01cantly learning (loss reduction) from iteration 0 to 2000.\nAveraged over the entire run, only 50.66% of parameters helped (see Table 1). Note that in both runs\nwe can see that in the earliest iterations, the percent of weights helping is higher, but only slightly.\n\nBarely over 50% of parameters help during training. According to our de\ufb01nition, for each\niteration of training, parameters can help, hurt, or not impact the overall loss. With that in mind, we\ncount the number of parameters that help, hurt, or neither, across all training iterations and for various\nnetworks; two examples of networks are shown in Figure 3 (all other networks shown in Section S3).\nThe data show that in a typical training iteration, close to half of parameters are helping and close\nto half are hurting! This ratio is slightly skewed towards helping in early iterations but stays fairly\nconstant during training. Averaged across all iterations, the percentage of helping parameters for\nvarious network con\ufb01gurations is reported in Table 1. We see that it varies within a small range of\n50% to 58%, with CIFAR networks even tighter at 50% to 51%. This observation also holds true\nwhen we look at each layer separately in a network; Figure 3(c) shows that all layers have similar\nratios of helpful parameters.\nParameters alternate helping. Now that we can tell if each parameter is \u201chelpful\u201d, \u201churtful\u201d, or\n\u201cneither\u201d1, we wonder if parameters predictably stay in the same category throughout training. In\nother words, is there a consistent elite group of parameters that always help? When we measure the\npercentage of helpful iterations per parameter throughout a training run, histograms in Figure 3(d)\nshow that parameters help approximately half of the time, and therefore the training of a network is\nachieved by parameters alternating to make helpful contribution to the loss.\nAdditionally, we can measure the oscillations of individual parameters. Figure S7 shows a high num-\nber of oscillations in weight movement for CIFAR\u2013ResNet on SGD: on average, weight movements\nchange direction once every 6.7 iterations, and gradients change signs every 9.5 iterations. Section S3\nincludes these measures for all networks, as well as detailed views in Figure S8 suggesting that many\nof these oscillations happen around local minima. While oscillations have been previously observed\nfor the overall network [32, 14], thanks to LCA, we\u2019re able to more precisely quantify the individual\nand net effects of these oscillations. As we\u2019ll see in Section 4, we can also use LCA to identify when\na network is damaged not by oscillations themselves, but by their precise phase.\nNoise persists across various hyperparameters. Changing the learning rate, momentum, or batch\nsize (within reasonable ranges such that the network still trains well) only have a slight effect on the\npercent of parameters helping. See Section S3 for a set of experiments on CIFAR\u2013ResNet with SGD,\nwhere percent helped always stays within 50.3% to 51.6% for reasonable hyperparameters.\n\n1We rarely see \u201cneither\u201d, or zero-impact parameters in CIFAR networks, but it can be of a noticable amount\n\nfor MNIST (around 20% for MNIST\u2013FC; see Figure 3), mostly due to the many dead pixels in MNIST.\n\n5\n\nMNIST-FC, SGDCIFAR-ResNet, SGD(b)(a)(c)(d)\fFigure 4: (left) LCA summed over all of training, for each layer, in CIFAR\u2013ResNet trained with SGD.\nBias and batch norm layers are combined into their corresponding kernel layers. Blue represents\nregular runs. Orange is with the last layer frozen at initialization. Note that the other layers, especially\nthe adjacent few, do not help as much, but the difference in LCA of the last layer is greater than\nthe total differences of the other layers helping less. Green is with the last layer at a 10x smaller\nlearning rate than the rest of the network, showing similar layer LCAs as when the layer is frozen.\n(right) Resulting train loss and standard deviations for each run con\ufb01guration. Means and standard\ndeviations are over 10 runs for each experiment con\ufb01guration.\n\nLearning is heavy-tailed. A reasonable mental model of the distribution of LCA might be a narrow\nGaussian around the mean. However, we \ufb01nd that this is far from reality. Instead, the LCA of both\nhelping and hurting parameters follow a heavy-tailed distribution, as seen in Figure 3(b). Figure S10\ngoes into more depth in this direction, showing that contributions from the tail are about three times\nlarger than would be expected if learning were Gaussian distributed. More precisely, a better model\nof LCA would be the Weibull distribution with k < 1. The measurements suggest that the view of\nlearning as a Wiener process [25] should be re\ufb01ned to re\ufb02ect the heavy tails.\n\n4 Some layers hurt overall\nAlthough our method is used to study low-level, per-parameter LCA, we can also aggregate these\nover higher level breakdowns for different insights; individually there is a lot of noise, but on the\nwhole, the network learns. The behavior of individual layers during learning has been of interest to\nmany researchers [35, 22], so a simple and useful aggregation is to sum LCA over all parameters\nwithin each layer and sum over all time, measuring how much each layer contributes to total learning.\nWe see an expected pattern for MNIST\u2013FC and MNIST\u2013Lenet (all layers helping; Figure S11),\nbut CIFAR\u2013ResNet with SGD shows a surprising pattern: the \ufb01rst and last layers consistently hurt\ntraining (positive total LCA). Over ten runs, the \ufb01rst and last layer in ResNet hurt statistically\nsigni\ufb01cantly (p-values < 10\u22124 for both), whereas all other layers consistently help (p-values < 10\u22124\nfor all). Blue bars in Figure 4 shows this distinct effect. Such a surprising observation calls for further\ninvestigation. The following experiments shed light on why this might be happening.\nFreezing the \ufb01rst layer stops it from hurting but causes others to help less. We try various\nexperiments freezing the \ufb01rst layer at its random initialization. Though we can prevent this layer from\nhurting, the overall performance is not any better because the other layers, especially the neighboring\nones, start to help less; see Figure S13 for details. Nonetheless, this can be useful for reducing\ncompute resources during training as you can freeze the \ufb01rst layer without impairing performance.\nFreezing the last layer results in signi\ufb01cant improvement.\nIn contrast to the \ufb01rst layer, freezing\nthe last layer at its initialization (Figure 4) improves training performance (and test performance\ncuriously; not shown), with p-values < 0.001 for both train loss and test loss, over 10 runs! We also\nobserve other layers, especially neighboring ones, not helping as much, but this time the change in the\nlast layer\u2019s LCA more than compensates. Decreasing the learning rate of the last layer by 10x (0.01\nas opposed to 0.1 for other layers) results in similar behavior as freezing it. These experiments are\nconsistent with \ufb01ndings in [12] and [8], which demonstrate that you can freeze the last layer in some\nnetworks without degrading performance. With LCA, we are now able to provide an explanation for\nwhen and why this phenomenon happens. The instability of the last layer at the start of training in [8]\ncan also be measured by LCA, as the LCA of the last layer is typically high in the \ufb01rst few iterations.\n\n6\n\n036912151821layer1.00.50.00.51.01.52.02.5sum of LCA across layerTotal LCA per layer, ResNet SGDregularlast layer lr=0.01last layer frozenstdev (10 runs)0.000.010.020.030.040.050.060.070.08lossResulting loss\fFigure 5: CIFAR\u2013ResNet SGD with varying momentum for the last layer (and a \ufb01xed 0.9 for all\nother layers). Selected momentum values are derived from linear values of delay [0, 1, 2, ..., 9] in a\ncontrol system, where momentum = delay/(delay + 1), and a delay of 9 corresponds to regular\nruns of 0.9 momentum. (left) LCA per layer (only the second half of the network is shown for better\nvisibility; \ufb01rst half follows a similar trend, but less pronounced). As the last layer helps more, the\nother layers hurt more because they are relatively more delayed. (right) LCA of the last layer is fairly\nlinear with respect to the delay.\n\nPhase shift hypothesis: is the last layer phase-lagged? While it is interesting to see that de-\ncreasing the learning rate by 10x or to zero changes the last layer\u2019s behavior, this on its own does\nnot explain why the layer would end up going backwards. The mini-batch gradient is an unbiased\nestimator of the whole training set gradient, so on average the dot product of the mini-batch gradient\nwith the training set gradient is positive. Thus we must look beyond noise and learning rate for\nexplanation. We hypothesize that the last layer may be phase lagged with respect to other layers\nduring learning. Intuitively, it may be that while all layers are oscillating during learning, the last\nlayer is always a bit behind. As each parameter swings back and forth across its valley, the shape of\nits valley is affected by the motion of all other parameters. If one parameter is frozen and all other\nparameters trained in\ufb01nitesimally slowly, that parameters valley will tend to \ufb02atten out. This means if\nit had climbed a valley (hurting the loss), it will not be able to fully recover the LCA in the negative\ndirection, as the steep region has been \ufb02attened. If the last layer reacts slower than others, its own\nvalley walls may tend to be \ufb02attened before it can react.\nA simple test for this hypothesis is as follows. We note that training with momentum 0.9 corresponds\nto an information lag of 9 steps (the mean of an exponential series with exponent .9)\u2014each update\napplied uses information 9 steps old. To give the last layer an advantage, we train it with momentum\ncorresponding to a delay of n for n \u2208 {9, 8, ..., 0} while training all other layers as usual. As shown\nin Figure 5, this works, and the transition from hurting to helping (a lot) is almost linear with respect\nto delay! As we give the last layer an information freshness advantage, it begins to \u201csteal progress\u201d\nfrom other layers, eventually forcing the neighboring layers into positive LCA. These results suggest\nthat it may be pro\ufb01table to view training as a fundamentally oscillatory process upon which much\nresearch in phase-space representations and control system design may come to bear.\nBeyond CIFAR\u2013Resnet, other networks also show intriguingly heterogeneous layer behaviors. As we\nnoted before, in the case of MNIST\u2013FC and MNIST\u2013LeNet trained with SGD, all layers help with\nvarying quantities. An MNIST\u2013ResNet (added speci\ufb01cally to see if the effect we see above is due to\nthe data or the network) shows the last layer hurting as well. We also observe the last layer hurting\nfor CIFAR\u2013AllCNN with SGD (Figure S14) and multiple layers hurting for a couple of VGG-like\nnetworks (Figure S12). When using Adam instead of SGD, CIFAR\u2013ResNet has a consistently hurting\n\ufb01rst layer and an inconsistently hurting last two layers. CIFAR\u2013AllCNN trained with Adam does\nnot have any hurting layers. We note that layers hurting is not a universal phenomenon that will be\nobserved in all networks, but when it does occur, LCA can identify it and suggest potential candidates\nto freeze. Further, viewing training through the lens of information delay seems valid, which suggests\nthat per-layer optimization adjustments may be bene\ufb01cial.\n\n5 Learning is synchronized across layers\nWe learned that layers tend to have their own distinct, consistent behaviors regarding hurting or\nhelping from per-layer LCA summed across all iterations. In this section we further examine the\nper-layer LCA during training, equivalent to studying individual \u201closs curves\u201d for each layer, and\n\n7\n\n10121416182022layer86420sum of LCA across layermomentum = 0.9momentum variesTotal LCA per layer, ResNet SGD with varied last layer momentumdelay = 9, momentum = 0.9 (regular)delay = 8, momentum = 0.889delay = 7, momentum = 0.875delay = 6, momentum = 0.857delay = 5, momentum = 0.833delay = 4, momentum = 0.8delay = 3, momentum = 0.75delay = 2, momentum = 0.667delay = 1, momentum = 0.5delay = 0, momentum = 002468last layer delay86420last layer LCALast layer: LCA vs. delayregular run\fFigure 6: Peak learning iterations by layer by class on MNIST\u2013FC. The same LCA data as in\nFigure S17 but seperated by class. We plot the top 20 iterations by LCA for each class and each layer,\nwhere that iteration represents a local minimum for LCA. The layers are ordered from bottom to\ntop. Points highlighted in red represent iterations where all three layers had peak learning for that\nparticular class. To measure the statistical signi\ufb01cance of these vertical line structures in red, we\nsimulate a baseline by shifting each layer in each class randomly by -2, -1, 0, or 1, 2 iteration. We\n\ufb01nd that the average number of vertical lines is 0.4 in the baseline and 9.4 in the actual network, and\nthis difference is signi\ufb01cant with a p-value < 0.001.\n\ndiscover that the exact moments where learning peaks are curiously synchronized across layers. And\nsuch synchronization is not driven by only gradients or parameter motion, but both.\nWe de\ufb01ne \u201cmoments of learning\u201d as temporal spikes in an instantaneous LCA curve, local minima\nwhere loss decreased more on that iteration than on the iteration before or after, and show the top 20\nsuch moments (highest magnitude of LCA) for each layer in Figure S17. We further decompose this\nmetric by class (10 for both MNIST and CIFAR), where the same moments of learning are identi\ufb01ed\non per-class, per-layer LCAs, shown in Figure 6. Whenever learning is synchronized across layers\n(dots that are vertically aligned) they are marked in red. Additional \ufb01gures on CIFAR\u2013ResNet can be\nseen in Section S5. The large proportion of red aligned stacks suggests that learning is very locally\nsynchronized across layers.\nTo gauge statistical signi\ufb01cance, we compare the number of synchronized moments in these networks\nto a simple baseline: the number we would observe if each layer had been randomly shifted to one or\ntwo iterations earlier or later. We \ufb01nd that the number synchronized moments is signi\ufb01cantly more\nthan that of such a baseline (p-value < 1\u22126). See details on this experiment in Section S5. Thus,\nwe conclude that for these networks we\u2019ve measured, learning happens curiously synchronously\nacross layers throughout the network. We might \ufb01nd different behavior in other architectures such as\ntransformer models or recurrent neural nets, which could be of interest for future work.\nBut what drives such synchronization? Since learning is de\ufb01ned as the product of parameter motion\nand parameter gradient, we further examine whether one of them is synchronized in the \ufb01rst place.\nBy plotting in the same fashion of identi\ufb01ed local peaks, we observe the synchronization pattern in\ngradients per layer is clearly different from that in LCA, either in terms of the total loss (Figure S18)\nor per-class loss (Figure S16). Since parameter motion (Figure S19) is the same across all classes, it\nalone doesn\u2019t drive the per-class LCA. We therefore conclude that the synchronization of learning,\ndemonstrated by synchronized behavior in LCA movement (Figure 6), is strong, and comes from\nboth parameter motion and gradient.\n\n8\n\n0255075100125150175200Training IterationClass: 0Class: 1Class: 2Class: 3Class: 4Class: 5Class: 6Class: 7Class: 8Class: 9\f6 Conclusion\n\nThe Loss Change Allocation method acts as a microscope into the training process, allowing us to\nexamine the inner workings of training with much more \ufb01ne-grained analysis. When applied to\nvarious tasks, networks and training runs, we observe many interesting patterns in neural network\ntraining that induce better understanding of training dynamics, and bring about practical model\nimprovements.\n\n6.1 Related work\n\nWe note additional connections to existing literature here. The common understanding is that learning\nin networks is sparse; a subnetwork [6], or a random subspace of parameters [19] is suf\ufb01cient for\noptimization and generalization. Our method provides an additional, more accurate, measure of\nusefulness to characterize per-parameter contribution. A similar work [34] de\ufb01nes per-parameter\nimportance in the same vein but is computed locally with the mini-batch gradient, which overestimates\nthe true per-parameter contribution to the decrease of loss of the whole training set.\nSeveral previous works have increased our understanding of the training process. Alain and Bengio [2]\nmeasured and tracked over time the ability to linearly predict the \ufb01nal class output given intermediate\nlayers representations. Raghu et al. [22] found that networks converge to \ufb01nal representations from\nthe bottom up, and class-speci\ufb01c information in networks is formed at various places. Shwartz-Ziv\nand Tishby [25] visualized the training process through the information plane, where two phases are\nidenti\ufb01ed as empirical error minimization of each followed by a slow representation compression.\nThere measurements are developed but none have examine the process each individual parameter\nundergoes.\nMethods like saliency maps [27], DeepVis [33], and others allow interpretation of representations or\nloss surfaces. But these works only approach the end result of the model, not the training process in\nprogress. LCA can be seen as a new tool that specializes on the microscopic level of details, and such\ninspection follows through the whole training process to reveal interesting facts about learning. Some\nof our \ufb01ndings resonate with and complement other work. For example, in [35] it is also observed\nthat layers have heterogeneous characteristics; in that work layers are denoted as either \u201crobust\u201d or\n\u201ccritical\u201d, and robust layers can even be reset to their initial value with no negative consequence.\n\n6.2 Future work\n\nThere are many potential directions to expand this work. Due to the expensive computation and the\namount of analyses, we have only tested vision classi\ufb01cation tasks on relatively small datasets so far.\nIn the future we would like to run this on larger datasets and tasks beyond supervised learning, since\nthe LCA method directly works on any parameterized model. An avenue to get past the expensive\ncomputation is to analyze how well this method can be approximated with gradients of loss of a\nsubset of the training set. We are interested to see if the observations we made hold beyond the vision\ntask and the range of hyperparameters used.\nSince per-weight LCA can be seen as a measurement of weight importance, an simple extension is to\nperform weight pruning with it, as done in [6, 36] (where weight\u2019s \ufb01nal value is used as an importance\nmeasure). Further, if there are strong correlations between underperforming hyperparameters and\npatterns of LCA, this may help in architecture search or identifying better hyperparameters.\nWe are also already able to identify which layers or parameters over\ufb01t by comparing their LCA on\nthe training set and LCA on the validation or test set, which motions towards future work on targeted\nregularization. Finally, the observations about the noise, oscillations, and phase delays can potentially\nlead to improved optimization methods.\n\nAcknowledgements\n\nWe would like to acknowledge Joel Lehman, Richard Murray, and members of the Deep Collective\nresearch group at Uber AI for conversations, ideas, and feedback on experiments.\n\n9\n\n\fReferences\n[1] Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural\n\nnetworks. CoRR, abs/1711.08856, 2017. URL http://arxiv.org/abs/1711.08856.\n\n[2] G. Alain and Y. Bengio. Understanding intermediate layers using linear classi\ufb01er probes. ArXiv\n\ne-prints, October 2016.\n\n[3] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. Siam Review, 60(2):223\u2013311, 2018.\n\n[4] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[5] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in neural information processing systems, pages 2933\u20132941,\n2014.\n\n[6] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable\nneural networks. In International Conference on Learning Representations (ICLR), volume\nabs/1803.03635, 2019. URL http://arxiv.org/abs/1803.03635.\n\n[7] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural\n\nnetwork optimization problems. arXiv preprint arXiv:1412.6544, 2014.\n\n[8] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look\nat deep learning heuristics: Learning rate restarts, warmup and distillation. In International\nConference on Learning Representations, 2019. URL https://openreview.net/forum?\nid=r14EOsCqKX.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n[11] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv\npreprint arXiv:1207.0580, 2012.\n\n[12] Elad Hoffer, Itay Hubara, and Daniel Soudry. Fix your classi\ufb01er: the marginal value of training\nthe last weight layer. CoRR, abs/1801.04540, 2018. URL http://arxiv.org/abs/1801.\n04540.\n\n[13] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/\nabs/1502.03167.\n\n[14] Stanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and\nAmos Storkey. On the Relation Between the Sharpest Directions of DNN Loss and the\nSGD Step Length. In International Conference on Learning Representations (ICLR), page\narXiv:1807.05031, Jul 2019.\n\n[15] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[16] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An-\ndrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis\nHassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):\n3521\u20133526, 2017. doi: 10.1073/pnas.1611835114. URL http://www.pnas.org/content/\n114/13/3521.abstract.\n\n10\n\n\f[17] Wilhelm Kutta. Beitrag zur n\u00e4herungweisen integration totaler differentialgleichungen. 1901.\n\n[18] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic\nDimension of Objective Landscapes. In International Conference on Learning Representations,\nApril 2018.\n\n[20] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the\nloss landscape of neural nets. In Advances in Neural Information Processing Systems, pages\n6389\u20136399, 2018.\n\n[21] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n2603\u20132612. JMLR. org, 2017.\n\n[22] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. Svcca: Singular vector canonical\ncorrelation analysis for deep learning dynamics and interpretability. ArXiv e-prints, June 2017.\n\n[23] Carl Runge. \u00dcber die numerische au\ufb02\u00f6sung von differentialgleichungen. Mathematische\n\nAnnalen, 46(2):167\u2013178, 1895.\n\n[24] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks.\n\nIn International Conference on Machine Learning, pages 774\u2013782, 2016.\n\n[25] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via\n\ninformation. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.\n\n[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\nimage recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.\n\n[27] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\nVisualising image classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034,\npresented at ICLR Workshop 2014, 2013.\n\n[28] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error\n\nguarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[29] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving\nfor simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014. URL http://arxiv.\norg/abs/1412.6806.\n\n[30] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[31] Eric W Weisstein. Simpson\u2019s rule. 2003.\n\n[32] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. 2018.\n\n[33] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding Neural Networks\n\nThrough Deep Visualization. ArXiv e-prints, June 2015.\n\n[34] Friedemann Zenke, Ben Poole, and Surya Ganguli.\n\nImproved multitask learning through\nsynaptic intelligence. CoRR, abs/1703.04200, 2017. URL http://arxiv.org/abs/1703.\n04200.\n\n[35] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal? arXiv preprint\n\narXiv:1902.01996, 2019.\n\n[36] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets:\n\nZeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.\n\n11\n\n\f", "award": [], "sourceid": 1958, "authors": [{"given_name": "Janice", "family_name": "Lan", "institution": "Uber AI"}, {"given_name": "Rosanne", "family_name": "Liu", "institution": "Uber AI Labs"}, {"given_name": "Hattie", "family_name": "Zhou", "institution": "Uber"}, {"given_name": "Jason", "family_name": "Yosinski", "institution": "Uber AI; Recursion"}]}