{"title": "SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability", "book": "Advances in Neural Information Processing Systems", "page_first": 6076, "page_last": 6085, "abstract": "We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less.", "full_text": "SVCCA: Singular Vector Canonical Correlation\n\nAnalysis for Deep Learning Dynamics and\n\nInterpretability\n\nMaithra Raghu,1,2 Justin Gilmer,1 Jason Yosinski,3 & Jascha Sohl-Dickstein1\n\n1Google Brain 2Cornell University 3Uber AI Labs\n\nmaithrar@gmail\u2022com, gilmer@google\u2022com, yosinski@uber\u2022com, jaschasd@google\u2022com\n\nAbstract\n\nWe propose a new technique, Singular Vector Canonical Correlation Analysis\n(SVCCA), a tool for quickly comparing two representations in a way that is both\ninvariant to af\ufb01ne transform (allowing comparison between different layers and\nnetworks) and fast to compute (allowing more comparisons to be calculated than\nwith previous methods). We deploy this tool to measure the intrinsic dimension-\nality of layers, showing in some cases needless over-parameterization; to probe\nlearning dynamics throughout training, \ufb01nding that networks converge to \ufb01nal\nrepresentations from the bottom up; to show where class-speci\ufb01c information in\nnetworks is formed; and to suggest new training regimes that simultaneously save\ncomputation and over\ufb01t less.\n\n1\n\nIntroduction\n\nAs the empirical success of deep neural networks ([7, 9, 18]) become an indisputable fact, the goal\nof better understanding these models escalates in importance. Central to this aim is a core issue\nof deciphering learned representations. Facets of this key question have been explored empirically,\nparticularly for image models, in [1, 2, 10, 12, 13, 14, 15, 19, 20]. Most of these approaches are\nmotivated by interpretability of learned representations. More recently, [11] studied the similarities\nof representations learned by multiple networks by \ufb01nding permutations of neurons with maximal\ncorrelation.\n\nIn this work we introduce a new approach to the study of network representations, based on an\nanalysis of each neuron\u2019s activation vector \u2013 the scalar outputs it emits on input datapoints. With\nthis interpretation of neurons as vectors (and layers as subspaces, spanned by neurons), we intro-\nduce SVCCA, Singular Vector Canonical Correlation Analysis, an amalgamation of Singular Value\nDecomposition and Canonical Correlation Analysis (CCA) [5], as a powerful method for analyzing\ndeep representations. Although CCA has not previously been used to compare deep representations,\nit has been used for related tasks such as computing the similarity between modeled and measured\nbrain activity [16], and training multi-lingual word embeddings in language models [3].\n\nThe main contributions resulting from the introduction of SVCCA are the following:\n\n1. We ask: is the dimensionality of a layer\u2019s learned representation the same as the number\nof neurons in the layer? Answer: No. We show that trained networks perform equally well\nwith a number of directions just a fraction of the number of neurons with no additional\ntraining, provided they are carefully chosen with SVCCA (Section 2.1). We explore the\nconsequences for model compression (Section 4.4).\n\n2. We ask: what do deep representation learning dynamics look like? Answer: Networks\nbroadly converge bottom up. Using SVCCA, we compare layers across time and \ufb01nd they\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fNeurons with highest activations\n\n(net1, net2)\n\nTop SVD Directions\n\n(net1, net2)\n\nTop SVCCA directions\n\n(net1, net2)\n\n4\n200\n200\n200\n200\n1\n\nnetwork\n\n(a)\n\nindex over dataset\n\nindex over dataset\n\nindex over dataset\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: To demonstrate SVCCA, we consider a toy regression task (regression target as in Figure 3). (a)\nWe train two networks with four fully connected hidden layers starting from different random initializations,\nand examine the representation learned by the penultimate (shaded) layer in each network. (b) The neurons\nwith the highest activations in net 1 (maroon) and in net 2 (green). The x-axis indexes over the dataset: in\nour formulation, the representation of a neuron is simply its value over a dataset (Section 2). (c) The SVD\ndirections \u2014 i.e. the directions of maximal variance \u2014 for each network. (d) The top SVCCA directions. We\nsee that each pair of maroon/green lines (starting from the top) are almost visually identical (up to a sign). Thus,\nalthough looking at just neurons (b) seems to indicate that the networks learn very different representations,\nlooking at the SVCCA subspace (d) shows that the information in the representations are (up to a sign) nearly\nidentical.\n\nsolidify from the bottom up. This suggests a simple, computationally more ef\ufb01cient method\nof training networks, Freeze Training, where lower layers are sequentially frozen after a\ncertain number of timesteps (Sections 4.1, 4.2).\n\n3. We develop a method based on the discrete Fourier transform which greatly speeds up the\n\napplication of SVCCA to convolutional neural networks (Section 3).\n\n4. We also explore an interpretability question, of when an architecture becomes sensitive to\ndifferent classes. We \ufb01nd that SVCCA captures the semantics of different classes, with\nsimilar classes having similar sensitivities, and vice versa. (Section 4.3).\n\nExperimental Details Most of our experiments are performed on CIFAR-10 (augmented with\nrandom translations). The main architectures we use are a convolutional network and a residual\nnetwork1. To produce a few \ufb01gures, we also use a toy regression task: training a four hidden layer\nfully connected network with 1D input and 4D output, to regress on four different simple functions.\n\n2 Measuring Representations in Neural Networks\n\nOur goal in this paper is to analyze and interpret the representations learned by neural networks. The\ncritical question from which our investigation departs is: how should we de\ufb01ne the representation\nof a neuron? Consider that a neuron at a particular layer in a network computes a real-valued\nfunction over the network\u2019s input domain. In other words, if we had a lookup table of all possible\ninput \u2192 output mappings for a neuron, it would be a complete portrayal of that neuron\u2019s functional\nform.\n\nHowever, such in\ufb01nite tables are not only practically infeasible, but are also problematic to process\ninto a set of conclusions. Our primary interest is not in the neuron\u2019s response to random data, but\nrather in how it represents features of a speci\ufb01c dataset (e.g. natural images). Therefore, in this\nstudy we take a neuron\u2019s representation to be its set of responses over a \ufb01nite set of inputs \u2014 those\ndrawn from some training or validation set.\nMore concretely, for a given dataset X = {x1, \u00b7 \u00b7 \u00b7 xm} and a neuron i on layer l, zzzl\nbe the vector of outputs on X, i.e.\n\ni, we de\ufb01ne zzzl\n\ni to\n\nzzzl\ni = (zzzl\n\ni(x1), \u00b7 \u00b7 \u00b7 , zzzl\n\ni(xm))\n\n1Convnet layers: conv-conv-bn-pool-conv-conv-conv-bn-pool-fc-bn-fc-bn-out. Resnet layers:\n\nconv-(x10 c/bn/r block)-(x10 c/bn/r block)-(x10 c/bn/r block)-bn-fc-out.\n\n2\n\n\fNote that this is a different vector from the often-considered vector of the \u201crepresentation at a layer\nof a single input.\u201d Here zzzl\ni is a single neuron\u2019s response over the entire dataset, not an entire layer\u2019s\nresponse for a single input. In this view, a neuron\u2019s representation can be thought of as a single\nvector in a high-dimensional space. Broadening our view from a single neuron to the collection of\nneurons in a layer, the layer can be thought of as the set of neuron vectors contained within that\nlayer. This set of vectors will span some subspace. To summarize:\n\nConsidered over a dataset X with m examples, a neuron is a vector in Rm.\n\nA layer is the subspace of Rm spanned by its neurons\u2019 vectors.\n\nWithin this formalism, we introduce Singular Vector Canonical Correlation Analysis (SVCCA) as\na method for analysing representations. SVCCA proceeds as follows:\n\n\u2022 Input: SVCCA takes as input two (not necessarily different) sets of neurons (typically\n\nlayers of a network) l1 = {zzzl1\n\n1 , ..., zzzl1\n\nm1 } and l2 = {zzzl2\n\n1 , ..., zzzl2\n\nm2 }\n\n1 \u2282 l1, l\u2032\n\n\u2022 Step 1 First SVCCA performs a singular value decomposition of each subspace to get sub-\nsubspaces l\u2032\n2 \u2282 l2 which comprise of the most important directions of the original\nsubspaces l1, l2. In general we take enough directions to explain 99% of variance in the\nsubspace. This is especially important in neural network representations, where as we will\nshow many low variance directions (neurons) are primarily noise.\n\n\u2022 Step 2 Second, compute the Canonical Correlation similarity ([5]) of l\u2032\n\n2: linearly trans-\n2 to be as aligned as possible and compute correlation coef\ufb01cients. In particu-\n}, CCA linearly\n\n1, l\u2032\n\n}, l\u2032\n\n2 = {zzz \u2032l2\n\n1 , ..., zzz \u2032l2\n\n1 = {zzz \u2032l1\n\n1 , ..., zzz \u2032l1\n\nm\u2032\n1\n\nm\u2032\n2\n\n1, l\u2032\n\nform l\u2032\nlar, given the output of step 1, l\u2032\ntransforms these subspaces \u02dcl1 = WX l\u2032\ncorrs = {\u03c11, . . . \u03c1min(m\u2032\n\n1,m\u2032\n\n1, \u02dcl2 = WY l\u2032\n\n2 such as to maximize the correlations\n\n2)} between the transformed subspaces.\n\n\u2022 Output: With these steps, SVCCA outputs pairs of aligned directions, (\u02dczzzl1\n\ni ) and how\nwell they correlate, \u03c1i. Step 1 also produces intermediate output in the form of the top\nsingular values and directions.\n\ni , \u02dczzzl2\n\nFor a more detailed description of each step, see the Appendix. SVCCA can be used to analyse\nany two sets of neurons. In our experiments, we utilize this \ufb02exibility to compare representations\nacross different random initializations, architectures, timesteps during training, and speci\ufb01c classes\nand layers.\n\nFigure 1 shows a simple, intuitive demonstration of SVCCA. We train a small network on a toy\nregression task and show each step of SVCCA, along with the resulting very similar representations.\nSVCCA is able to \ufb01nd hidden similarities in the representations.\n\n2.1 Distributed Representations\n\nAn important property of SVCCA is that it is truly a subspace method: both SVD and CCA work\nwith span(zzz1, . . . , zzzm) instead of being axis aligned to the zzzi directions. SVD \ufb01nds singular vectors\ni = Pm\nzzz \u2032\nj=1 sijzzzj , and the subsequent CCA \ufb01nds a linear transform W , giving orthogonal canon-\nically correlated directions {\u02dczzz1, . . . , \u02dczzzm} = {Pm\nIn other words,\nSVCCA has no preference for representations that are neuron (axes) aligned.\n\nj, . . . , Pm\n\nj=1 wmjzzz \u2032\n\nj=1 w1jzzz \u2032\n\nj}.\n\nIf representations are distributed across many dimensions, then this is a desirable property of a\nrepresentation analysis method. Previous studies have reported that representations may be more\ncomplex than either fully distributed or axis-aligned [17, 21, 11] but this question remains open.\n\nWe use SVCCA as a tool to probe the nature of representations via two experiments:\n\n(a) We \ufb01nd that the subspace directions found by SVCCA are disproportionately important to\n\nthe representation learned by a layer, relative to neuron-aligned directions.\n\n(b) We show that at least some of these directions are distributed across many neurons.\n\nExperiments for (a), (b) are shown in Figure 2 as (a), (b) respectively. For both experiments, we \ufb01rst\nacquire two different representations, l1, l2, for a layer l by training two different random initializa-\ntions of a convolutional network on CIFAR-10. We then apply SVCCA to l1 and l2 to get directions\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: Demonstration of (a) disproportionate importance of SVCCA directions, and (b) distributed nature\nof some of these directions. For both panes, we \ufb01rst \ufb01nd the top k SVCCA directions by training two conv nets\non CIFAR-10 and comparing corresponding layers. (a) We project the output of the top three layers, pool1, fc1,\nfc2, onto this top-k subspace. We see accuracy rises rapidly with increasing k, with even k \u226a num neurons\ngiving reasonable performance, with no retraining. Baselines of random k neuron subspaces and max activation\nneurons require larger k to perform as well. (b): after projecting onto top k subspace (like left), dotted lines\nthen project again onto m neurons, chosen to correspond highly to the top k-SVCCA subspace. Many more\nneurons are needed than k for better performance, suggesting distributedness of SVCCA directions.\n\n1 , ..., \u02dczzzl1\n\nm} and {\u02dczzzl2\n\n{\u02dczzzl1\nlinear combination of the original neurons, i.e. \u02dczzzli\n\nj = Pm\n\nr=1 \u03b1\n\n(li)\njr zzzli\nr .\n\n1 , ..., \u02dczzzl2\n\nm}, ordered according to importance by SVCCA, with each \u02dczzzli\n\nj being a\n\nFor different values of k < m, we can then restrict layer li\u2019s output to lie in the subspace of\nspan(\u02dczzzli\nk ), the most useful k-dimensional subspace as found by SVCCA, done by projecting\neach neuron into this k dimensional space.\n\n1 , . . . , \u02dczzzli\n\nWe \ufb01nd \u2014 somewhat surprisingly \u2014 that very few SVCCA directions are required for the network\nto perform the task well. As shown in Figure 2(a), for a network trained on CIFAR-10, the \ufb01rst\n25 dimensions provide nearly the same accuracy as using all 512 dimensions of a fully connected\nlayer with 512 neurons. The accuracy curve rises rapidly with the \ufb01rst few SVCCA directions, and\nplateaus quickly afterwards, for k \u226a m. This suggests that the useful information contained in m\nneurons is well summarized by the subspace formed by the top k SVCCA directions. Two base-\nlines for comparison are picking random and maximum activation neuron aligned subspaces and\nprojecting outputs onto these. Both of these baselines require far more directions (in this case: neu-\nrons) before matching the accuracy achieved by the SVCCA directions. These results also suggest\napproaches to model compression, which are explored in more detail in Section 4.4.\n\nFigure 2(b) next demonstrates that these useful SVCCA directions are at least somewhat distributed\nover neurons rather than axis-aligned. First, the top k SVCCA directions are picked and the rep-\nresentation is projected onto this subspace. Next, the representation is further projected onto m\nneurons, where the m are chosen as those most important to the SVCCA directions . The resulting\naccuracy is plotted for different choices of k (given by x-axis) and different choices of m (different\nlines). That, for example, keeping even 100 fc1 neurons (dashed green line) cannot maintain the\naccuracy of the \ufb01rst 20 SVCCA directions (solid green line at x-axis 20) suggests that those 20\nSVCCA directions are distributed across 5 or more neurons each, on average. Figure 3 shows a\nfurther demonstration of the effect on the output of projecting onto top SVCCA directions, here for\nthe toy regression case.\n\nWhy the two step SV + CCA method is needed. Both SVD and CCA have important properties\nfor analysing network representations and SVCCA consequently bene\ufb01ts greatly from being a two\nstep method. CCA is invariant to af\ufb01ne transformations, enabling comparisons without natural\nalignment (e.g. different architectures, Section 4.4). See Appendix B for proofs and a demonstrative\n\ufb01gure. While CCA is a powerful method, it also suffers from certain shortcomings, particularly in\ndetermining how many directions were important to the original space X, which is the strength of\n\n4\n\n0100200300400500Number of directions0.20.40.60.8AccuracyCIFAR10: Accuracy with SVCCA directions and random neuronsp2 (4096 neurons) SVCCAp2 max acts neuronsp2 random neuronsfc1 (512 neurons) SVCCAfc1 random neuronsfc2 (256 neurons) SVCCAfc2 max acts neurons01020304050Number of directions0.10.20.30.40.50.60.70.80.91.0AccuracyCIFAR10 acc vs neurons used for SVCCA dirnsSVCCA fc1 (512 neurons)SVCCA p2 (4096 neurons)50 neurons for fc1150 neurons for p2300 neurons for p2100 neurons for fc1\fFigure 3: The effect on the output of a latent representation being projected onto top SVCCA directions in\nthe toy regression task. Representations of the penultimate layer are projected onto 2, 6, 15, 30 top SVCCA\ndirections (from second pane). By 30, the output looks very similar to the full 200 neuron output (left).\n\nSVD. See Appendix for an example where naive CCA performs badly. Both the SVD and CCA\nsteps are critical to the analysis of learning dynamics in Section 4.1.\n\n3 Scaling SVCCA for Convolutional Layers\n\nApplying SVCCA to convolutional layers can be done in two natural ways:\n\n(1) Same layer comparisons: If X, Y are the same layer (at different timesteps or across ran-\ndom initializations) receiving the same input we can concatenate along the pixel (height h,\nwidth w) coordinates to form a vector: a conv layer h \u00d7 w \u00d7 c maps to c vectors, each\nof dimension hwd, where d is the number of datapoints. This is a natural choice because\nneurons at different pixel coordinates see different image data patches to each other. When\nX, Y are two versions of the same layer, these c different views correspond perfectly.\n\n(2) Different layer comparisons: When X, Y are not the same layer, the image patches seen by\ndifferent neurons have no natural correspondence. But we can \ufb02atten an h\u00d7w \u00d7c conv into\nhwc neurons, each of dimension d. This approach is valid for convs in different networks\nor at different depths.\n\n3.1 Scaling SVCCA with Discrete Fourier Transforms\n\nApplying SVCCA to convolutions introduces a computational challenge: the number of neurons\n(h\u00d7w\u00d7c) in convolutional layers, especially early ones, is very large, making SVCCA prohibitively\nexpensive due to the large matrices involved. Luckily the problem of approximate dimensionality\nreduction of large matrices is well studied, and ef\ufb01cient algorithms exist, e.g. [4].\n\nFor convolutional layers however, we can avoid dimensionality reduction and perform exact\nSVCCA, even for large networks. This is achieved by preprocessing each channel with a Discrete\nFourier Transform (which preserves CCA due to invariances, see Appendix), causing all (covari-\nance) matrices to be block-diagonal. This allows all matrix operations to be performed block by\nblock, and only over the diagonal blocks, vastly reducing computation. We show:\n\nTheorem 1. Suppose we have a translation invariant (image) dataset X and convolutional layers\nl1, l2. Letting DF T (li) denote the discrete fourier transform applied to each channel of li, the\ncovariance cov(DF T (l1), DF T (l2)) is block diagonal, with blocks of size c \u00d7 c.\n\nWe make only two assumptions: 1) all layers below l1, l2 are either conv or pooling layers with\ncircular boundary conditions (translation equivariance) 2) The dataset X has all translations of the\nimages Xi. This is necessary in the proof for certain symmetries in neuron activations, but these\nsymmetries typically exist in natural images even without translation invariance, as shown in Fig-\nure App.2 in the Appendix. Below are key statements, with proofs in Appendix.\n\nDe\ufb01nition 1. Say a single channel image dataset X of images is translation invariant if for any\n(wlog n \u00d7 n) image Xi \u2208 X, with pixel values {zzz11, ...zzznn}, X\n= {zzz\u03c3a(1)\u03c3b(1), ...zzz\u03c3a(n)\u03c3b(n)}\nis also in X, for all 0 \u2264 a, b \u2264 n \u2212 1, where \u03c3a(i) = a + i mod n (and similarly for b).\n\n(a,b)\ni\n\nFor a multiple channel image Xi, an (a, b) translation is an (a, b) height/width shift on every channel\nseparately. X is then translation invariant as above.\n\n5\n\n050000100000150000200000432101234Original output using 200 directions050000100000150000200000Projection on top 02 SVCCA directions050000100000150000200000Projection on top 06 SVCCA directions050000100000150000200000Projection on top 15 SVCCA directions050000100000150000200000Projection on top 30 SVCCA directions\fTo prove Theorem 1, we \ufb01rst show another theorem:\n\nTheorem 2. Given a translation invariant dataset X, and a convolutional layer l with channels\n{c1, . . . ck} applied to X\n\n(a) the DFT of ci, F cF T has diagonal covariance matrix (with itself).\n(b) the DFT of ci, cj , F ciF T , F cjF T have diagonal covariance with each other.\nFinally, both of these theorems rely on properties of circulant matrices and their DFTs:\n\nLemma 1. The covariance matrix of ci applied to translation invariant X is circulant and block\ncirculant.\n\nLemma 2. The DFT of a circulant matrix is diagonal.\n\n4 Applications of SVCCA\n\n4.1 Learning Dynamics with SVCCA\n\nWe can use SVCCA as a window into learning dynamics by comparing the representation at a\nlayer at different points during training to its \ufb01nal representation. Furthermore, as the SVCCA\ncomputations are relatively cheap to compute compared to methods that require training an auxiliary\nnetwork for each comparison [1, 10, 11], we can compare all layers during training at all timesteps\nto all layers at the \ufb01nal time step, producing a rich view into the learning process.\n\nThe outputs of SVCCA are the aligned directions (\u02dcxi, \u02dcyi), how well they align, \u03c1i, as well as in-\nY , y\u2032(j). We\ntermediate output from the \ufb01rst step, of singular values and directions, \u03bb\ncondense these outputs into a single value, the SVCCA similarity \u00af\u03c1, that encapsulates how well the\nrepresentations of two layers are aligned with each other,\n\nX , x\u2032(i), \u03bb\n\n(j)\n\n(i)\n\n\u00af\u03c1 =\n\n1\n\nmin (m1, m2) X\n\ni\n\n\u03c1i,\n\n(1)\n\nwhere min (m1, m2) is the size of the smaller of the two layers being compared. The SVCCA\nsimilarity \u00af\u03c1 is the average correlation across aligned directions, and is a direct multidimensional\nanalogue of Pearson correlation.\n\nThe SVCCA similarity for all pairs of layers, and all time steps, is shown in Figure 4 for a convnet\nand a resnet architecture trained on CIFAR10.\n\n4.2 Freeze Training\n\nObserving in Figure 4 that networks broadly converge from the bottom up, we propose a training\nmethod where we successively freeze lower layers during training, only updating higher and higher\nlayers, saving all computation needed for deriving gradients and updating in lower layers.\n\nWe apply this method to convolutional and residual networks trained on CIFAR-10, Figure 5, using\na linear freezing regime: in the convolutional network, each layer is frozen at a fraction (layer num-\nber/total layers) of total training time, while for resnets, each residual block is frozen at a fraction\n(block number/total blocks). The vertical grey dotted lines show which steps have another set of lay-\ners frozen. Aside from saving computation, Freeze Training appears to actively help generalization\naccuracy, like early stopping but with different layers requiring different stopping points.\n\n4.3\n\nInterpreting Representations: when are classes learned?\n\nWe also can use SVCCA to compare how correlated representations in each layer are with the logits\nof each class in order to measure how knowledge about the target evolves throughout the network.\nIn Figure 6 we apply the DFT CCA technique on the Imagenet Resnet [6]. We take \ufb01ve different\nclasses and for different layers in the network, compute the DFT CCA similarity between the logit\nof that class and the network layer. The results successfully re\ufb02ect semantic aspects of the classes:\nthe \ufb01retruck class sensitivity line is clearly distinct from the two pairs of dog breeds, and network\ndevelops greater sensitivity to \ufb01retruck earlier on. The two pairs of dog breeds, purposefully chosen\nso that each pair is similar to the other in appearance, have cca similarity lines that are very close to\neach other through the network, indicating these classes are similar to each other.\n\n6\n\n\f0% trained\n\n35% trained\n\n75% trained\n\n100% trained\n\n0\n1\n-\nR\nA\nF\nI\n\nC\n\n \n,\nt\ne\nn\nv\nn\no\nC\n\n0\n1\n-\nR\nA\nF\nI\n\nC\n\n \n,\nt\ne\nn\ns\ne\nR\n\n)\ng\nn\ni\nn\ni\na\nr\nt\n \ng\nn\ni\nr\nu\nd\n(\n \nr\ne\ny\na\nl\n\n)\ng\nn\ni\nn\ni\na\nr\nt\n \ng\nn\ni\nr\nu\nd\n(\n \nr\ne\ny\na\nl\n\nlayer (end of training)\n\nlayer (end of training)\n\nlayer (end of training)\n\nlayer (end of training)\n\nWeighted SVCCA scale\n\nFigure 4: Learning dynamics plots for conv (top) and res (bottom) nets trained on CIFAR-10. Each pane is\na matrix of size layers \u00d7 layers, with each entry showing the SVCCA similarity \u00af\u03c1 between the two layers.\nNote that learning broadly happens \u2018bottom up\u2019 \u2013 layers closer to the input seem to solidify into their \ufb01nal\nrepresentations with the exception of the very top layers. Per layer plots are included in the Appendix. Other\npatterns are also visible \u2013 batch norm layers maintain nearly perfect similarity to the layer preceding them due\nto scaling invariance (with a slight reduction since batch norm changes the SVD directions which capture 99%\nof the variance). In the resnet plot, we see a stripe like pattern due to skip connections inducing high similarities\nto previous layers.\n\nFigure 5: Freeze Training reduces training cost and improves generalization. We apply Freeze Training to a\nconvolutional network on CIFAR-10 and a residual network on CIFAR-10. As shown by the grey dotted lines\n(which indicate the timestep at which another layer is frozen), both networks have a \u2018linear\u2019 freezing regime:\nfor the convolutional network, we freeze individual layers at evenly spaced timesteps throughout training. For\nthe residual network, we freeze entire residual blocks at each freeze step. The curves were averaged over ten\nruns.\n\n4.4 Other Applications: Cross Model Comparison and compression\n\nSVCCA similarity can also be used to compare the similarity of representations across different\nrandom initializations, and even different architectures. We compare convolutional networks on\nCIFAR-10 across random initializations (Appendix) and also a convolutional network to a residual\nnetwork in Figure 7, using the DFT method described in 3.\n\nIn Figure 3, we saw that projecting onto the subspace of the top few SVCCA directions resulted in\ncomparable accuracy. This observations motivates an approach to model compression. In particular,\nletting the output vector of layer l be xxx(l) \u2208 Rn\u00d71, and the weights W (l), we replace the usual\nx )(Pxxxx(l)) where Px is a k \u00d7 n projection matrix, projecting xxx onto the top\nW (l)xxx(l) with (W (l)P T\nSVCCA directions. This bottleneck reduces both parameter count and inference computational cost\n\n7\n\n020000400006000080000100000120000140000160000Train step0.700.750.800.850.90AccuracyCIFAR10 Conv Freeze Trainingtest acc basetest acc freeze020000400006000080000100000120000140000160000Train step0.700.750.800.850.90CIFAR10 Resnet Freeze Trainingtest acc base test acc freeze\fFigure 6: We plot the CCA similarity using the Discrete Fourier Transform between the logits of \ufb01ve classes\nand layers in the Imagenet Resnet. The classes are \ufb01retruck and two pairs of dog breeds (terriers and husky\nlike dogs: husky and eskimo dog) that are chosen to be similar to each other. These semantic properties are\ncaptured in CCA similarity, where we see that the line corresponding to \ufb01retruck is clearly distinct from the\ntwo pairs of dog breeds, and the two lines in each pair are both very close to each other, re\ufb02ecting the fact that\neach pair consists of visually similar looking images. Firetruck also appears to be easier for the network to\nlearn, with greater sensitivity displayed much sooner.\n\nFigure 7: We plot the CCA similarity using the Discrete Fourier Transform between convolutional layers of a\nResnet and Convnet trained on CIFAR-10. We \ufb01nd that the lower layrs of both models are noticeably similar to\neach other, and get progressively less similar as we compare higher layers. Note that the highest layers of the\nresnet are least similar to the lower layers of the convnet.\n\nfor the layer by a factor \u223c k\nn . In Figure App.5 in the Appendix, we show that we can consecutively\ncompress top layers with SVCCA by a signi\ufb01cant amount (in one case reducing each layer to 0.35\noriginal size) and hardly affect performance.\n\n5 Conclusion\n\nIn this paper we present SVCCA, a general method which allows for comparison of the learned dis-\ntributed representations between different neural network layers and architectures. Using SVCCA\nwe obtain novel insights into the learning dynamics and learned representations of common neural\nnetwork architectures. These insights motivated a new Freeze Training technique which can reduce\nthe number of \ufb02ops required to train networks and potentially even increase generalization perfor-\nmance. We observe that CCA similarity can be a helpful tool for interpretability, with sensitivity\nto different classes re\ufb02ecting their semantic properties. This technique also motivates a new algo-\nrithm for model compression. Finally, the \u201clower layers learn \ufb01rst\u201d behavior was also observed for\nrecurrent neural networks as shown in Figure App.6 in the Appendix.\n\n8\n\n01020304050607080Layer Number0.30.40.50.60.70.80.91.0CCA Similarity with ClassCCA Similarity (using DFT) of Layers in Imagenet Resnet to Different Classess_terrierw_terrierhuskyeskimo_dogfire truckinbncvbncvbncvbncvbncvbncvbncvbnResnet layersp2bn2c5c4c3p1bn1c2c1inConvnet layersDFT CCA similarity between Resnet and Convnet on CIFAR100.00.10.20.30.40.50.60.70.80.91.0\fReferences\n\n[1] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi\ufb01er\n\nprobes. arXiv preprint arXiv:1610.01644, 2016.\n\n[2] David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. Understanding deep architectures\n\nusing a recursive convolutional network. arXiv preprint arXiv:1312.1847, 2013.\n\n[3] Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilin-\n\ngual correlation. Association for Computational Linguistics, 2014.\n\n[4] Nathan Halko, Martinsson Per-Gunnar, and Joel A. Tropp. Finding structure with random-\nness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM\nRev., 53:217\u2013288, 2011.\n\n[5] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview\n\nwith application to learning methods. Neural Computation, 16:2639\u20132664, 2004.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[7] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep\nJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neu-\nral networks for acoustic modeling in speech recognition: The shared views of four research\ngroups. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[8] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 1985.\n\n[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[10] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their\nequivariance and equivalence. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 991\u2013999, 2015.\n\n[11] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft. Convergent Learning: Do different\nIn International Conference on Learning\n\nneural networks learn the same representations?\nRepresentations (ICLR), May 2016.\n\n[12] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning:\nDo different neural networks learn the same representations? In Feature Extraction: Modern\nQuestions and Challenges, pages 196\u2013212, 2015.\n\n[13] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by in-\nverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-\nnition, pages 5188\u20135196, 2015.\n\n[14] Gr\u00b4egoire Montavon, Mikio L Braun, and Klaus-Robert M\u00a8uller. Kernel analysis of deep net-\n\nworks. Journal of Machine Learning Research, 12(Sep):2563\u20132581, 2011.\n\n[15] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional\narXiv preprint\n\nnetworks: Visualising image classi\ufb01cation models and saliency maps.\narXiv:1312.6034, 2013.\n\n[16] David Sussillo, Mark M Churchland, Matthew T Kaufman, and Krishna V Shenoy. A neural\nnetwork that \ufb01nds a naturalistic solution for the production of muscle activity. Nature neuro-\nscience, 18(7):1025\u20131033, 2015.\n\n[17] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian\narXiv preprint\n\nIntriguing properties of neural networks.\n\nGoodfellow, and Rob Fergus.\narXiv:1312.6199, 2013.\n\n9\n\n\f[18] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural ma-\nchine translation system: Bridging the gap between human and machine translation. arXiv\npreprint arXiv:1609.08144, 2016.\n\n[19] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding\nneural networks through deep visualization. In Deep Learning Workshop, International Con-\nference on Machine Learning (ICML), 2015.\n\n[20] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean conference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[21] Bolei Zhou, Aditya Khosla, `Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object de-\ntectors emerge in deep scene cnns. In International Conference on Learning Representations\n(ICLR), volume abs/1412.6856, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3088, "authors": [{"given_name": "Maithra", "family_name": "Raghu", "institution": "Cornell University and Google Brain"}, {"given_name": "Justin", "family_name": "Gilmer", "institution": "Google Brain"}, {"given_name": "Jason", "family_name": "Yosinski", "institution": "Uber"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}]}