{"title": "Neural Networks Trained to Solve Differential Equations Learn General Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 4071, "page_last": 4081, "abstract": "We introduce a technique based on the singular vector canonical correlation analysis (SVCCA) for measuring the generality of neural network layers across a continuously-parametrized set of tasks. We illustrate this method by studying generality in neural networks trained to solve parametrized boundary value problems based on the Poisson partial differential equation. We find that the first hidden layers are general, and that they learn generalized coordinates over the input domain. Deeper layers are successively more specific. Next, we validate our method against an existing technique that measures layer generality using transfer learning experiments. We find excellent agreement between the two methods, and note that our method is much faster, particularly for continuously-parametrized problems. Finally, we also apply our method to networks trained on MNIST, and show it is consistent with, and complimentary to, another study of intrinsic dimensionality.", "full_text": "Neural Networks Trained to Solve Differential\n\nEquations Learn General Representations\n\nMartin Magill\n\nU. of Ontario Inst. of Tech.\nmartin.magill1@uoit.net\n\nFaisal Z. Qureshi\n\nU. of Ontario Inst. of Tech.\nfaisal.qureshi@uoit.ca\n\nHendrick W. de Haan\n\nU. of Ontario Inst. of Tech.\nhendrick.dehaan@uoit.ca\n\nAbstract\n\nWe introduce a technique based on the singular vector canonical correlation anal-\nysis (SVCCA) for measuring the generality of neural network layers across a\ncontinuously-parametrized set of tasks. We illustrate this method by studying gen-\nerality in neural networks trained to solve parametrized boundary value problems\nbased on the Poisson partial differential equation. We \ufb01nd that the \ufb01rst hidden\nlayers are general, and that they learn generalized coordinates over the input do-\nmain. Deeper layers are successively more speci\ufb01c. Next, we validate our method\nagainst an existing technique that measures layer generality using transfer learning\nexperiments. We \ufb01nd excellent agreement between the two methods, and note that\nour method is much faster, particularly for continuously-parametrized problems.\nFinally, we also apply our method to networks trained on MNIST, and show it is\nconsistent with, and complimentary to, another study of intrinsic dimensionality.\n\n1\n\nIntroduction\n\nGenerality of a neural network layer indicates that it can be used successfully in neural networks\ntrained on a variety of tasks [19]. Previously, Yosinski et al. [19] developed a method for measuring\nlayer generality using transfer learning experiments, and used it to compare generality of layers\nbetween two image classi\ufb01cation tasks. In this work, we will study the generality of layers across a\ncontinuously-parametrized set of tasks: a group of similar problems whose details are changed by\nvarying a real number. We found the transfer learning method for measuring generality prohibitively\nexpensive for this task. Instead, by relating generality to similarity, we develop a computationally\nef\ufb01cient measure of generality that uses the singular vector canonical correlation analysis (SVCCA).\nWe demonstrate this method by measuring layer generality in neural networks trained to solve\ndifferential equations. We train fully-connected tanh neural networks (NNs) to solve Poisson\u2019s\nequation with a parametrized source term. The parameter of the source de\ufb01nes a family of related\nboundary value problems (BVPs), and we measure the generality of layers in the trained NNs as the\nparameter varies. We \ufb01nd the \ufb01rst layers to be general, and deeper layers to be progressively more\nspeci\ufb01c. Using the SVCCA, we are also able to visualize and interpret these general \ufb01rst layers.\nWe validate our approach by reproducing a subset of our results using the transfer learning experimen-\ntal protocol of Yosinski et al. [19]. These very different methods produce consistent measurements of\ngenerality. Further, our technique is several orders of magnitude faster to compute. Finally, we apply\nour method to ReLU networks trained on the MNIST dataset [9], and compare to work by Li et al.\n[11]. We discuss how the two analyses differ, but con\ufb01rm that our results are consistent with theirs.\nThe main contributions of this work are:\n\n1. We develop a method for ef\ufb01ciently computing layer generality over a continuously-\n\nparametrized family of tasks using the SVCCA.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2. Using this method, we demonstrate generality in the \ufb01rst layers of NNs trained to solve\nproblems from a parametrized family of BVPs. We \ufb01nd that deeper layers become succes-\nsively more speci\ufb01c to the problem parameter, and that network width can play an important\nrole in determining layer generality.\n\n3. We visualize the principal components of the \ufb01rst layers that were found to be general. We\ninterpret them as generalized coordinates that re\ufb02ect important subregions of the unit square.\n4. We validate our method for measuring layer generality using the transfer learning experi-\nmental protocol developed by Yosinski et al. [19]. We \ufb01nd that both approaches identify\nthe same trends in layer generality as network width is varied, but that our approach is\nsigni\ufb01cantly more computationally ef\ufb01cient, especially for continuously parametrized tasks.\n5. We de\ufb01ne a measure of the intrinsic dimensionality of a layer, and contrast it with that of Li\n\net al. [11]. We show the two are consistent for networks trained on the MNIST dataset.\n\n1.1 Neural networks for differential equations\n\nThe idea to solve differential equations using neural networks was \ufb01rst proposed by Dissanayake and\nPhan-Thien [3]. They trained neural networks to minimize the loss function\n\n(cid:90)\n\n(cid:90)\n\nL =\n\n(cid:107)G[u](x)(cid:107)2dV +\n\n(cid:107)B[u](x)(cid:107)2dS,\n\n(1)\n\n\u2126\n\n\u2202\u2126\n\nwhere G and B are differential operators on the domain \u2126 and its boundary \u2202\u2126 respectively, G[u] = 0\nis the differential equation, and B[u] = 0 describes boundary conditions. Training data consisted\nof coordinates x \u2208 \u2126 sampled from a mesh, used to numerically approximate the integrals in L at\neach epoch. Similar methods were proposed by van Milligen et al. [17] and Lagaris et al. [7]. Many\ninnovations have been made since, most of which were reviewed by Schmidhuber [15] and in a book\nby Yadav et al. [18]. Sirignano and Spiliopoulos [16] as well as Berg and Nystr\u00f6m [2] illustrated\nthat the training points can be obtained by randomly sampling the domain rather than using a mesh,\nwhich signi\ufb01cantly enhances performance in higher-dimensional problems. In fact, Sirignano and\nSpiliopoulos [16] and Han et al. [5] have demonstrated that neural networks can be used to solve\npartial differential equations in hundreds of dimensions, which is a revolutionary result. Traditionally,\nsuch problems have often been considered infeasible, since traditional mesh-based solvers suffer\nfrom an exponential growth in computational complexity with increasing problem dimensionality.\nThere are at least two good reasons for studying neural networks that solve differential equations\n(referred to hereafter as DENNs). The \ufb01rst is their unique advantages over traditional methods for\nsolving differential equations [2\u20135, 7, 16, 17]. The second is that they offer an opportunity to study\nthe behaviour of neural networks in a well-understood context [2]. Most applications of neural\nnetworks, such as machine vision and natural language processing, involve solving problems that are\nill-de\ufb01ned or have no known solutions. Conversely, there exists an enormous body of literature on\ndifferential equation problems, detailing when solutions exist, when they are unique, and how they\nwill behave. Indeed, in some cases the exact solutions to the problem can be obtained analytically.\n\n1.2 Studying the generality of features with transfer learning\n\nTransfer learning is a major topic in machine learning, reviewed for instance by Pan and Yang [13].\nGenerally, transfer learning in neural networks entails initializing a recipient neural network using\nsome of the weights from a donor neural network that was previously trained on a related task.\nYosinski et al. [19] developed an experimental protocol for quantifying the generality of neural\nnetwork layers using transfer learning experiments. They de\ufb01ned generality as the extent to which a\nlayer from a network trained on some task A can be used for another task B. For instance, the \ufb01rst\nlayers of CNNs trained on image data are known to be general: they always converge to the same\nfeatures, namely Gabor \ufb01lters (which detect edges) and color blobs (which detect colors) [6, 8, 10].\nIn the protocol developed by Yosinski et al. [19], the \ufb01rst n layers from a donor network trained on\ntask A are used to initialize the \ufb01rst n layers of a recipient network. The remaining layers of the\nrecipient are randomly initialized, and it is trained on task B. However, the transferred layers are\nfrozen: they are not updated during training on task B. The recipient is expected to perform as well\non task B as did the donor on task A if and only if the transferred layers are general.\n\n2\n\n\fIn practice, however, various other factors can impact the performance of the recipient on task B,\nso Yosinski et al. [19] also included three control tests. The \ufb01rst control is identical to the actual\ntest, except that the recipient is trained on the original task A; this control identi\ufb01es any fragile\nco-adaptation between consecutive layers [19]. The other two controls entail repeating the actual test\nand the \ufb01rst control, but allowing the transferred layers to be retrained. When the recipient is trained\non task A with retraining, performance should return to that of the donor network. When it is trained\non task B with retraining, Yosinski et al. [19] found the recipient actually outperformed the donor.\nYosinski et al. [19] successfully used their method to con\ufb01rm the generality of the \ufb01rst layers of\nimage-based CNNs. Further, they also discovered a previously unknown generality in their second\nlayers. This methodology, however, was constructed for the binary comparison of two tasks A and B.\nIn the present work, we are interested in studying layer generality across a continuously parametrized\nset of tasks (given by a family of BVPs), and the transfer learning methodology is prohibitively\ncomputationally expensive. Instead, we will use a different approach, based on the SVCCA, which\nwe will then validate against the method of Yosinski et al. [19] on a set of test cases.\n\n1.3 SVCCA: Singular Vector Canonical Correlation Analysis\n\nYosinski et al. [19] de\ufb01ned generality of a layer to mean that it can be used successfully in networks\nperforming a variety of tasks. This de\ufb01nition was motivated, however, by observing that the \ufb01rst\nlayers of image-based CNNs converged upon similar features across many network architectures and\napplications. We argue that these two concepts are related: if a certain representation leads to good\nperformance across a variety of tasks, then well-trained networks learning any of those tasks will\ndiscover similar representations. In this spirit, we de\ufb01ne a layer to be general across some group of\ntasks if similar layers are consistently learned by networks trained on any of those tasks. To use this\nde\ufb01nition to measure generality, then, we require a quantitative measure of layer similarity.\nRecently, the SVCCA was demonstrated by Raghu et al. [14] to be a powerful method for measuring\nthe similarity of neural network layers [14]. The SVCCA considers the activation functions of a\nlayer\u2019s neurons evaluated at points sampled throughout the network\u2019s input domain. In this way\nit incorporates problem-speci\ufb01c information, and as a result it outperforms older metrics of layer\nsimilarity that only consider the weights and biases of a layer. For instance, Li et al. [12] proposed\nmeasuring layer similarity by \ufb01nding neuron permutations that maximized correlation between\nnetworks. As a linear algebraic algorithm, however, the SVCCA is more computationally ef\ufb01cient\nthan permutation-based methods. Similarly, Berg and Nystr\u00f6m [2] have concurrently attempted to\nstudy the structure of DENNs by analyzing weights and biases directly, but found the results to be\ntoo sensitive to the local minima into which their networks converged.\nFollowing Raghu et al. [14], we will use the SVCCA to de\ufb01ne a scalar measure of the similarity of\ntwo layers. The SVCCA returns canonical directions in which two layers are maximally correlated.\nThey de\ufb01ned the SVCCA similarity \u03c1 of two layers as the average of these optimal correlation values.\nHowever, this quantity depends explicitly on the layers\u2019 widths, independently of the functions the\nlayers represent. Here, instead of the mean, we will de\ufb01ne \u03c1 as the sum of these correlations. Since\nwe typically found that the majority of the correlations were nearly 1.0 or nearly 0.0, this SVCCA\nsimilarity roughly measures the number of signi\ufb01cant dimensions shared by two layers. In particular,\nsince the SVCCA between a layer and itself is equivalent to a principal component analysis, we will\nuse the SVCCA self-similarity as an approximate measure of a layer\u2019s intrinsic dimensionality.\nThis concept of intrinsic dimensionality differs from that recently proposed by Li et al. [11]. They\nconstrained network weights during training to a random d-dimensional subspace for various values\nof d, and de\ufb01ned the intrinsic dimensionality of a given network on a given task as the smallest d\nfor which good performance is achieved. This metric differs from ours in two important ways. First,\ntheir algorithm \ufb01nds the smallest representation required to solve a problem, whereas our de\ufb01nition\ndirectly analyses the actual representations learned in practice. Speci\ufb01cally, they consider a strongly\nregularized auxilliary problem, whereas we examine given solutions directly. Second, their measure\nis based on the performance of representations, whereas ours measures structure. Indeed, since\nRaghu et al. [14] were able to compress models with little loss of performance by keeping only the\n\ufb01rst few important SVCCA directions, the remaining important SVCCA directions must describe\nstructures present in layers\u2019 representations that do not directly in\ufb02uence network performance. Our\nmethod is complimentary to that of Li et al. [11]: theirs \ufb01nds compact solutions that perform well,\nwhich is of practical value, but ours can examine any given network without altering its properties.\n\n3\n\n\f2 Methodology\n\n2.1 Problem de\ufb01nition\n\nFollowing concurrent work by Berg and Nystr\u00f6m [2], we will study the structure of DENNs on\na parametrized family of PDEs. Berg and Nystr\u00f6m [2] used a family of Poisson equations on a\ndeformable domain. They attempted to characterize the properties of the DENN solutions by studying\nthe variances of their weights and biases. However, they reported that their metrics of study were too\nsensitive to the local minima into which their solutions converged for them to draw conclusions [2].\nIn this work, we have repeated this experiment, but using the SVCCA as a more robust tool for\nstudying the structure of the solutions. The family of PDEs considered here was\n\n\u22072u(x, y) = s(x, y)\nu(x, y) = 0,\n\nfor\nfor\n\n(x, y) \u2208 \u2126,\n(x, y) \u2208 \u2202\u2126,\n\n(2)\n(3)\n\n(cid:16)\u2212 (x\u2212x(cid:48))2+(y\u2212y(cid:48))2\n\n(cid:17)\n\n2r2\n\nwhere \u2126 = [\u22121, 1] \u00d7 [\u22121, 1] is the domain and \u2212s(x, y) is a nascent delta function given by\n\ns(x, y) = \u2212\u03b4r(x, y; x(cid:48), y(cid:48)) = \u2212 exp\n\n(4)\nwhich satis\ufb01es limr\u21920 \u03b4r(x, y; x(cid:48), y(cid:48)) = \u03b4(x \u2212 x(cid:48))\u03b4(y \u2212 y(cid:48)), where \u03b4 is the Dirac delta function. For\nthe present work, we will \ufb01x y(cid:48) = 0 and r = 0.1, and vary only x(cid:48). Thus, the BVPs describe the\nelectric potential produced by a localized charge distribution on a square domain with grounded\nedges. The problems are parametrized by x(cid:48). We relegate deformable domains to future work.\n\n2\u03c0r2\n\n,\n\n2.2\n\nImplementation details\n\nThe networks used in this work were all fully-connected with 4 hidden layers of equal width,\nimplemented in TensorFlow [1]. Activation functions were tanh, except in Section 3.4, where ReLU\nwas used. Given inputs x and y, the network was trained to directly approximate u(x, y), the solution\nto a BVP from the family of BVPs described above. Training followed the DGM methodology of\nSirignano and Spiliopoulos [16]. More implementation details are discussed in the supplemental\nmaterial. Since this work was not focused on optimization of performance, we used relatively generic\nhyperparameters whenever possible to ensure that our results are reasonably general.\n\n3 Results\n\n3.1 Quantifying layer generality in DENNs using SVCCA\n\nIn this section, we use the SVCCA to study the generality of layers in DENNs trained to solve our\nfamily of BVPs. We train DENNs to solve the BVPs for a range of x(cid:48) values, each from four different\nrandom initializations per x(cid:48) value. We will refer to the different random initializations as the \ufb01rst\nthrough fourth random seeds for each x(cid:48) value (see the supplemental material for details about the\nrandom seed construction). First, we present results for networks of width 20. We condense our\nanalysis into three metrics, and then study how those metrics vary with network width.\nFigure 1 shows the SVCCA similarities computed between the \ufb01rst, third, and fourth hidden layers of\nnetworks of width 20. The matrix for the second hidden layer is omitted, but closely resembles that\nfor the \ufb01rst hidden layer. The (i, j)th element of the matrices show the SVCCA similarity computed\nbetween the given layers of the ith and jth networks in our dataset. Since the SVCCA similarity does\nnot depend on the order in which the layers are compared, the matrices are symmetric. The black\ngrid lines of the matrices separate layers by the x(cid:48) values on which they were trained, and the four\nseeds for each x(cid:48) are grouped between the black grid lines.\nThe matrices evidently exhibit a lot of symmetry, and can be decomposed into subregions. The \ufb01rst is\nthe diagonal of the matrices, which contains the self-similarities of the layers, denoted \u03c1l\nself in the\nlth layer. The second region contains the matrix elements that lie inside the block diagonal formed\nby the black grid lines, but that are off the main diagonal. These indicate the similarities between\nlayers trained on the same x(cid:48) values, but from different random seeds, and will be denoted \u03c1l\n\u2206x(cid:48)=0.\nThe remaining matrix elements were found to be equivalent along the block-off-diagonals. These\n\n4\n\n\fFigure 1: Matrices of layer-wise SVCCA similarities between the \ufb01rst, third, and fourth hidden layers\nof networks of width 20 trained at various x(cid:48) values, with four random seeds per position. The black\nlines group layers on each axis by the x(cid:48) values at which they were trained. For each x(cid:48) value, the four\nentries correspond to four distinct random seeds. Thus the matrix diagonals contain self-similarities,\nthe block diagonals formed by black lines contain similarities across random seeds at a \ufb01xed x(cid:48), and\nthe remaining entries correspond to comparisons between distinct x(cid:48) values.\n\nFigure 2: For each layer, crosses show mean similarities between distinct layers as a function of the\ndifference in the x(cid:48) values at which they were trained. Diamonds show mean self-similarities. For\nboth, error bars indicate maximum and minimum values. The gray lines show the null hypothesis\ndescribed in the text, namely that the representations are independent of x(cid:48).\n\n\u2206x(cid:48).\n\ncorrespond to all similarities computed between lth layers from networks trained on x(cid:48) values that\ndiffer by \u2206x(cid:48), which we will denote \u03c1l\nWith this decomposition in mind, the matrices can be represented more succinctly as the plots shown\nin Figure 2. The diamonds and their error bars show the mean, minima, and maxima of \u03c1l\nself in\neach layer l. The crosses and their error bars show the means, minima, and maxima of \u03c1l\n\u2206x(cid:48) for\nvarying source-to-source distances \u2206x(cid:48). As described above, the statistics of \u03c1l\n\u2206x(cid:48)=0 were computed\n\u2206x(cid:48)=0(cid:105) for each layer, and are used\nexcluding the self-similarities \u03c1l\nbelow to quantify speci\ufb01city. We show the minima and maxima of the data in order to emphasize that\nour decomposition of the matrices in Figure 1 accurately re\ufb02ects the structure of the data.\nIn the plots of Figure 2, the gap between \u03c1l\n\u2206x(cid:48)=0 indicates the extent to which different\nself and \u03c1l\nrandom initializations trained on the same value of x(cid:48) converge to the same representations. For\nthis reason, we de\ufb01ne the ratio (cid:104)\u03c1l\nself(cid:105) as the reproducibility. It measures what fraction\n\u2206x(cid:48)=0(cid:105)/(cid:104)\u03c1l\nof a layer\u2019s intrinsic dimensionality is consistently reproduced across different random seeds. We\nsee that, for networks of width 20, the \ufb01rst layer is highly reproducible, and the second is mostly\n\nself. The dashed gray lines show (cid:104)\u03c1l\n\n5\n\n0.00.10.20.30.40.50.6x\u2032 of second network0.00.10.20.30.40.50.6x\u2032 of first networkFirst hidden layer0.00.10.20.30.40.50.6x\u2032 of second networkThird hidden layer0.00.10.20.30.40.50.6x\u2032 of second networkFourth hidden layer101214161820 \u03c10.00.20.40.6Difference in x\u2032101214161820SVCCA similari y, \u03c1Firs hidden layerSecond hidden layer0.00.20.40.6Difference in x\u2032Third hidden layerNull hypo hesis0.00.20.40.6Difference in x\u2032Four h hidden layer\u03c14self\u03c14\u018ax\u2032\u03940.0\fFigure 3: The intrinsic dimensionality, reproducibility, and speci\ufb01city of the four layers at varying\nwidth. The lines indicate mean values. The error bars on intrinsic dimensionality indicate maxima\nand minima, whereas the error bars on reproducibility and speci\ufb01city indicate estimated uncertainty\non the means (discussed in the supplemental material). Numbers indicate layer numbers. The inset in\n(a) shows the limiting dimensionalities of the four layers at width 192.\n\n(cid:12)(cid:12)(cid:104)\u03c1l\n\n\u2206x(cid:48)(cid:105)(cid:12)(cid:12) /(cid:104)\u03c1l\n\n\u2206x(cid:48)=0(cid:105) \u2212 (cid:104)\u03c1l\n\nspeci\ufb01city as\n\n\u2206x(cid:48)=0(cid:105) and (cid:104)\u03c1l\n\n\u2206x(cid:48) values should be distributed no differently than \u03c1l\n\nreproducible. Conversely, the third and fourth layers in Figure 2 have a gap of roughly 3 out of\n20 between (cid:104)\u03c1l\nself(cid:105): networks from different random seeds at the same x(cid:48) value are\nconsistently dissimilar in about 15% of their canonical components.\nWe can use the plots of Figure 2 to quantify the generality of the layers. When a layer is general\nacross x(cid:48), the similarity between layers should not depend on the x(cid:48) values at which they were trained.\nThus the \u03c1l\n\u2206x(cid:48)=0. Visually, when a layer is\ngeneral, the crosses in Figure 2 should be within error of the dashed grey lines. Similarly, the distance\nbetween the crosses and the dashed line is proportional to the speci\ufb01city of a layer. Thus we can see\nin Figure 2 that, for networks of width 20, the \ufb01rst and second layers appear to be general, whereas\nthe third and fourth are progressively more speci\ufb01c.\nTo quantify this, we will de\ufb01ne a layer\u2019s\n\nthe average over \u2206x(cid:48) of\n\u2206x(cid:48)=0(cid:105). In Figure 2, this is equivalent to the mean distance from the crosses\nto the dashed grey line, normalized by the height of the dashed grey line. Equivalently, it is the ratio\nof the area delimited by the crosses and the dashed line to the area under the dashed line. It can\nalso be interpreted as a numerical estimation of the normalized L1 norm of the difference between\nthe measured (cid:104)\u03c1l\n\u2206x(cid:48)=X(cid:105) and the null hypothesis of a perfectly general layer. By this de\ufb01nition, a\nlayer will have a speci\ufb01city of 0 if and only if it has similar representations across all values of \u2206x(cid:48).\nFurthermore, the speci\ufb01city is proportional to how much (cid:104)\u03c1l\n\u2206x(cid:48)(cid:105) varies with \u2206x(cid:48). Thus the speci\ufb01city\nmetric we de\ufb01ned here is indeed consistent with the accepted de\ufb01nitions of generality and speci\ufb01city.\nThe same experiments described above for networks of width 20 were repeated for widths of 8, 12,\n16, 24, 48, 96, and 192. Figure 3 shows the measured intrinsic dimensionalities, reproducibilities,\nand speci\ufb01cities of the four layers. The error bars on the intrinsic dimensionalities show minima\nand maxima, emphasizing that these measurements were consistent across different values of x(cid:48) and\ndifferent random seeds. The error bars on the reproducibility and speci\ufb01city show the estimated\nuncertainty on the means, as discussed in the supplemental material.\nIn narrow networks, the layers\u2019 intrinsic dimensionalities (Fig. 3(a)) equal the network width. As\nthe network width increases, these dimensionalities drop below the width, and appear to converge\nto \ufb01nite values. We suggest that, for a \ufb01xed x(cid:48) value, there are \ufb01nite-dimensional representations\nto which the layers will consistently converge, so long as the networks are wide enough to support\nthose representations. If the networks are too narrow, they converge to some smaller-dimensional\nprojections of those representations. The reproducibility plots (Fig. 3(b)) support this interpretation,\nas the reproducibilities grow with network width. Furthermore, they are smaller for deeper layers,\nexcept in very wide networks where the fourth layer becomes more reproducible than the second and\nthird. This could be related to convergence issues in very wide networks, as discussed below.\n\n6\n\n101102Network width0204060\u001b\u03c11self\u001b\u22489\u001b\u03c12self\u001b\u224821\u001b\u03c13self\u001b\u224834\u001b\u03c14self\u001b\u224852\u03c1self=width\u27e8a\u27e9I tri sic dime sio alit(\u001b\u03c1self\u001b101102Network width0.800.850.900.951.00(b)1234Reproducibilit(.\u03c1\u018ax\u2032=0\u001b.\u03c1self\u001b101102Network\u27e8width0.000.020.04(c)1234Specificit(\u03a3\u018ax\u2032|\u001b\u03c1\u018ax\u2032=0\u001b\u2212\u001b\u03c1\u018ax\u2032\u001b|N\u018ax\u2032\u001b\u03c1\u018ax\u2032=0\u001b1234Layer050\u001b\u03c1self\u001bWidth:\u0394192\fFigure 4: Plots of the \ufb01rst nine principal components of the \ufb01rst layer of a network trained on x(cid:48) = 0.6\n(obtained by self-SVCCA). The numbers show the SVCCA correlations of each component.\n\nThe limiting dimensionalities increase nearly linearly with layer depth, as shown in the inset of\nFigure 3(a). This has implications for sparsi\ufb01cation of DENNs, as Raghu et al. [14] showed suc-\ncessful sparsi\ufb01cation by eliminating low-correlation components of the SVCCA. Similarly, DENN\narchitectures that widen with depth may be more optimal than \ufb01xed-width architectures.\nThe speci\ufb01city (Fig. 3(c)) varies more richly with network width. Overall, the \ufb01rst layer is most\ngeneral, and successive layers are progressively more speci\ufb01c. Over small to medium widths, the\nsecond layer is nearly as general as the \ufb01rst layer; the third layer transitions from highly speci\ufb01c to\nquite general; and the fourth layer remains consistently speci\ufb01c. In very wide networks, however, the\nsecond and third layers appear to become more speci\ufb01c, whereas the fourth layer becomes somewhat\nmore general. Future work should explore the behaviour at large widths, but we speculate that it may\nbe related to changes in training dynamics at large widths. As discussed in the supplemental material,\nvery wide networks seemed to experience very broad minima in the loss landscape, so our training\nprotocol may have terminated before the layers converged to optimal and general representations.\nThe overall trends in speci\ufb01city discussed above are interrupted near widths of 16, 20, and 24. All\nfour layers appear somewhat more general than expected at width 16, and then more speci\ufb01c than\nexpected at width 20. By width 24 and above they resume a more gradual variation with width. This\nis a surprising result that future work should explore more carefully. It occurs as the network width\nexceeds the limiting dimensionality of the second layer, which may play a role in this phenomenon.\n\n3.2 Visualizing and interpreting the canonical directions\n\nWe have shown that the \ufb01rst layers of the DENNs studied here converge to general 9-dimensional\nrepresentations independent of the parameter x(cid:48). Figure 4 shows a visualization of the \ufb01rst 9 principal\ncomponents (obtained by self-SVCCA) of the \ufb01rst layer of a network of width 192 trained at x(cid:48) = 0.6,\nshown as contour maps. We interpret these as generalized coordinates. The contours are densest where\nthe corresponding coordinates are most sensitive. It is clear that the \ufb01rst 2 of these 9 components\ncapture precisely the same information as x and y, but rotated. The remaining components act\ntogether to identify 9 regions of interest in the domain: the 4 corners, the 4 walls, and the center. For\ninstance, component (e) describes the distance from the top and bottom walls; component (i) does the\nsame for the left and right walls; and component (d) describes distance to the upper-left corner. We\nfound the \ufb01rst layers could be interpreted this way at any x(cid:48) and whether we found components by\nself-SVCCA or cross-SVCCA, and have included examples of this in the supplemental material. The\ncomponents are always some linear combination of x, y, and the 9 regions described above.\nSurprisingly, we found that the SVCCA was numerically unstable. Repeated analyses of the same\nnetworks produced slightly different components, although the correlation vectors were very stable.\nWe see two factors contributing to this problem. Firstly, the \ufb01rst 7 or 8 correlation values of the\n\ufb01rst layer are all extremely close to 1 and, therefore, to one another. Thus the task of sorting the\ncorresponding components is inevitably ill-conditioned. Second, the components appear to be paired\ninto subspaces, such as the \ufb01rst two in Figure 4. Thus the task of splitting these subspaces into\none-dimensional components is also ill-conditioned. We propose that future work should explore\ncomponent analyses that search for closely-coupled components. This could resolve the numerical\nstability while also extracting even more structure about the layer representations.\n\n3.3 Con\ufb01rming generality by transfer learning experiments\n\nIn this section, we validate the method used to measure generality in Section 3.1 by repeating a subset\nof our measurements using the transfer learning technique established by Yosinski et al. [19]. We\nrestricted our validation to a subset of cases because the transfer learning technique is signi\ufb01cantly\n\n7\n\n(a)(b)(c)(g)(h)(i)(e)(f)(d)\fFigure 5: (a-d): Results of the transfer learning experiments conducted on networks of four different\nwidths. Markers indicate means, and error bars indicated maxima and minima. At n = 0, the lines\npass through the average of the two base cases. Donors were trained at x(cid:48)\nA = 0, and recipients at x(cid:48)\nB.\n(e): Measured transfer speci\ufb01city as a function of network width. Numbers indicate layer number.\nMarkers show the ratio of the mean losses, and error bars show the maximum and minimum ratios\nover all 16 combinations of the two losses. The dashed line shows a transfer speci\ufb01city of 1.\n\nB = 0 the selffer cases, and the experimental cases with x(cid:48)\n\nmore computationally expensive. To this end, we only trained donor networks at x(cid:48)\nA = 0 and\nmeasured generality towards x(cid:48)\nB = 0.6. Following Yosinski et al. [19], we will call the control cases\nwith x(cid:48)\nB = 0.6 the transfer cases. We show\nthe results for widths of 8, 16, 20, and 24 in Figure 5(a-d). Throughout this section, we will refer to\nthe measure of layer speci\ufb01city we de\ufb01ned in Section 3.1 as the SVCCA speci\ufb01city, to distinguish it\nfrom the measure of layer speci\ufb01city obtained from the transfer learning experiments, which we call\nthe transfer speci\ufb01city. In Figure 5(a-d), the transfer speci\ufb01city is given by the difference between the\nlosses of the frozen transfer group (solid, dark red points) and those of the frozen selffer group (solid,\ndark blue points). It is immediately clear that the third and fourth layers are much more speci\ufb01c than\nthe \ufb01rst and second at all widths. The speci\ufb01cities of the \ufb01rst, second, and fourth layers do not change\nvery much with width, whereas the third layer appears to become more general with increasing width.\nThese results are in agreement with those found with the SVCCA speci\ufb01city.\nTo quantify these differences, we de\ufb01ne a transfer speci\ufb01city metric given by the ratio of the losses\nbetween the two frozen groups. This is shown in Figure 5(e), and it can be compared to the SVCCA\nspeci\ufb01cities for the same widths, which lie in the leftmost third of Figure 3(c). The dashed line in\nFigure 5(e) shows a transfer speci\ufb01city of 1, corresponding to a perfectly general layer. The \ufb01rst\nand second layers have transfer speci\ufb01cities of roughly 5, and are general (within error) at all widths.\nThe fourth layer, on the other hand, has a transfer speci\ufb01city of roughly 105, and is highly speci\ufb01c at\nall widths. Whereas those layers\u2019 transfer speci\ufb01cities do not change signi\ufb01cantly with width, the\nthird layer becomes increasingly general as the width increases. Its transfer speci\ufb01city decreases by\nroughly a factor of 4 from roughly 64 at width 8 to 18 at width 16. In Figure 3(c), by comparison, its\nSVCCA speci\ufb01city drops from roughly 5% at width 8 to 2% at width 16. Thus the transfer speci\ufb01city\nmetric agrees with the main results of the SVCCA speci\ufb01city: at all four widths, the \ufb01rst two layers\nare general, and the fourth is very speci\ufb01c; the third layer is speci\ufb01c, albeit much less so than the\nfourth, and becomes more general as the width increases.\nReturning to Figure 5(a-d), recall that the remaining control groups also contain information about\nnetwork structure. Any difference between the two selffer groups (the two blue series) indicates fragile\nco-adaptation. We note possible fragile co-adaptation at a width of 8, especially at n = 3. Future\nwork should try measuring co-adaptation using the SVCCA, perhaps by measuring the similarities\nof different layers within the same network, as done by Raghu et al. [14]. Finally, any signi\ufb01cant\ndifference between the two retrained groups (the two dashed series) was meant to check if retraining\ntransferred layers boosted recipient performance; however, this was not seen in any of our cases.\n\n8\n\n02410\u2212510\u2212310\u22121(a) Width: 8(b) Width: 16Final testing l ss (l wer is better)x\u2032=0, basex\u2032=0.6, basex\u2032B=0.0, retrainx\u2032B=0.0, free(ex\u2032B=0.6, retrainx\u2032B=0.6, free(e02410)510)310)1(c) Width: 20024(d) Width: 24Number f transferred layers, n81620241011031051(e) Transfer specificityl ssfr (entransferl ssfr (enselfferNetw rk width1234\fFigure 6: Test accuracies, intrinsic dimensionalities and reproducibilities of networks trained on the\nMNIST dataset for various L2 regularization weights \u03bb and network widths. The error bars in (a) and\n(b) show maxima and minima; those in (c) show the estimated standard error.\n\nOverall, the transfer speci\ufb01city used by Yosinski et al. [19] shows good agreement with the SVCCA\nspeci\ufb01city we de\ufb01ned. We note however that the SVCCA speci\ufb01city is much faster to compute.\nBoth methods require the training of the original set of networks without transfer learning, which\ntook about 2 hours per network using our methodology and hardware. We could then compute all\nthe SVCCA speci\ufb01cities for this work in roughly 15 minutes. On the other hand, Figure 5 required\nhundreds of extra hours to compute, and only considers four widths and two x(cid:48) values. That method\nwould be prohibitively expensive for measuring generality in any continuously-parametrized problem.\n\n3.4\n\nIntrinsic dimensionality and reproducibility on MNIST\n\nWe also applied our metrics to the same networks trained instead on the MNIST dataset [9], and with\nReLU activation functions rather than tanh. The networks were trained to minimize the classi\ufb01cation\ncross entropy plus an L2 regularization term with weight \u03bb. We used widths of 50, 100, 200, and 400;\n\u03bb values of 0, 0.01, and 0.05; and four random seeds per combination. Li et al. [11] measured the\nintrinsic dimensionalities of such networks, and found them to be vastly overparametrized.\nFigure 6(a) shows, for each \u03bb, the range of test accuracies over all widths after training on 2000\nbatches of 100 images. Figures 6(b) and 6(c) show the intrinsic dimensionalities and reproducibilities,\nrespectively, by width, layer number, and \u03bb. Without regularization (i.e. with \u03bb = 0), the intrinsic\ndimensionalities of all four layers are nearly equal to their widths. This apparent contradiction to Li\net al. [11] arises because their method is itself strongly regularizing. As we increase \u03bb, the intrinsic\ndimensionalities decrease more rapidly than performance, which is consistent with the results of Li\net al. [11]. We \ufb01nd low reproducibility in all experiments, even though the accuracies and intrinsic\ndimensionalities are quite consistent across seeds. This suggests that, at \ufb01xed width and regularization\nstrength, although the networks consistently converge to representations of the same dimension, and\nalthough they exhibit comparable accuracy, the details of the learned representations vary signi\ufb01cantly\nacross seeds. In other words, the optimal representations in this experiment are non-unique. Since our\nmetrics can be computed ef\ufb01ciently, future work should explore how this conclusion evolves during\ntraining. This experiment illustrates how our \ufb01rst two metrics, developed for DENNs, can also be\napplied more broadly. Our metric of speci\ufb01city is based on a continuously-changing task; extending\nthis to MNIST could be done, for instance, by varying the relative sampling of the target classes.\n\n4 Conclusion\n\nIn this paper, we presented a method for measuring layer generality over a continuously-parametrized\nset of problems using the SVCCA. Using this method, we studied the generality of layers in DENNs\nover a parametrized family of BVPs. We found that the \ufb01rst layer is general; the second is somewhat\nless so; the third is general in wide networks but speci\ufb01c in narrow ones; and the fourth is speci\ufb01c for\nwidths up to 192. We visualized the general components identi\ufb01ed in the \ufb01rst layers and interpreted\nthem as generalized coordinates capturing features of interest in the input domain. We validated\nour method against the transfer learning protocol of Yosinski et al. [19]. The methods show good\nagreement, but our method is much faster, especially on continuously-parametrized problems. Finally,\nwe contrasted our intrinsic dimensionality with that used by Li et al. [11]. The two are distinct but\ncomplimentary, and produce consistent results for networks trained on the MNIST dataset [9].\n\n9\n\n00.010.05L2 regularization weight, \u03bb0.80.91.0(a)Final test accuracy0200400Network width0200400\u03bb=00.010.05Layers:1234(b)Intrinsic dimensionality20040050100Network width0.50.60.7(c)Reproducibility\fAcknowledgements\n\nMM gratefully acknowledges funding from the Ontario Graduate Scholarship (OGS). FZQ gratefully\nacknowledges funding from the Natural Sciences and Engineering Research Council (NSERC) in the\nform of Discovery Grant 2015-04533. HWdH gratefully acknowledges funding from the Natural\nSciences and Engineering Research Council (NSERC) in the form of Discovery Grant 2014-06091.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,\nGeoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur,\nJosh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,\nJonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay\nVasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,\nand Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL\nhttps://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n[2] Jens Berg and Kaj Nystr\u00f6m. A uni\ufb01ed deep arti\ufb01cial neural network approach to partial differential\n\nequations in complex geometries. arXiv preprint arXiv:1711.06464, 2017.\n\n[3] M. W. M. G. Dissanayake and N. Phan-Thien. Neural-network-based approximations for solving partial\ndifferential equations. Communications in Numerical Methods in Engineering, 10(3):195\u2013201, 1994.\ndoi: 10.1002/cnm.1640100303. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cnm.\n1640100303.\n\n[4] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe von Wurstemberger. A proof that arti\ufb01cial\nneural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes\npartial differential equations. arXiv preprint arXiv:1809.02362, 2018.\n\n[5] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using\n\ndeep learning. Proceedings of the National Academy of Sciences, 115(34):8505\u20138510, 2018.\n\n[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[7] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Arti\ufb01cial neural networks for solving ordinary\n\nand partial differential equations. IEEE Transactions on Neural Networks, 9(5):987\u20131000, 1998.\n\n[8] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. ICA with reconstruction cost for\nef\ufb01cient overcomplete feature learning. In Advances in neural information processing systems, pages\n1017\u20131025, 2011.\n\n[9] Yann LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.\n\n[10] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks\nfor scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual\ninternational conference on machine learning, pages 609\u2013616. ACM, 2009.\n\n[11] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension of\n\nObjective Landscapes. arXiv preprint arXiv:1804.08838, 2018.\n\n[12] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different\nneural networks learn the same representations? In Feature Extraction: Modern Questions and Challenges,\npages 196\u2013212, 2015.\n\n[13] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and\n\ndata engineering, 22(10):1345\u20131359, 2010.\n\n[14] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singular Vector\nCanonical Correlation Analysis for Deep Learning Dynamics and Interpretability. In Advances in Neural\nInformation Processing Systems, pages 6078\u20136087, 2017.\n\n[15] J\u00fcrgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85\u2013117, 2015.\n\n[16] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial\n\ndifferential equations. arXiv preprint arXiv:1708.07469, 2017.\n\n10\n\n\f[17] B Ph van Milligen, V Tribaldos, and J A Jim\u00e9nez. Neural Network Differential Equation and Plasma\n\nEquilibrium Solver. Physical Review Letters, 75(20):3594, 1995.\n\n[18] Neha Yadav, Anupam Yadav, and Manoj Kumar. An introduction to neural network methods for differential\n\nequations. Springer, 2015.\n\n[19] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\n\nnetworks? In Advances in neural information processing systems, pages 3320\u20133328, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2010, "authors": [{"given_name": "Martin", "family_name": "Magill", "institution": "University of Ontario Institute of Technology"}, {"given_name": "Faisal", "family_name": "Qureshi", "institution": "University of Ontario Institute of Technology, Canada"}, {"given_name": "Hendrick", "family_name": "de Haan", "institution": "University of Ontario Institute of Technology"}]}