{"title": "Visualizing the PHATE of Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1842, "page_last": 1853, "abstract": "Understanding why and how certain neural networks outperform others is key to guiding future development of network architectures and optimization methods. To this end, we introduce a novel visualization algorithm that reveals the internal geometry of such networks: Multislice PHATE (M-PHATE), the first method designed explicitly to visualize how a neural network's hidden representations of data evolve throughout the course of training. We demonstrate that our visualization provides intuitive, detailed summaries of the learning dynamics beyond simple global measures (i.e., validation loss and accuracy), without the need to access validation data. Furthermore, M-PHATE better captures both the dynamics and community structure of the hidden units as compared to visualization based on standard dimensionality reduction methods (e.g., ISOMAP, t-SNE). We demonstrate M-PHATE with two vignettes: continual learning and generalization. In the former, the M-PHATE visualizations display the mechanism of \"catastrophic forgetting\" which is a major challenge for learning in task-switching contexts. In the latter, our visualizations reveal how increased heterogeneity among hidden units correlates with improved generalization performance. An implementation of M-PHATE, along with scripts to reproduce the figures in this paper, is available at https://github.com/scottgigante/M-PHATE.", "full_text": "Visualizing the PHATE of Neural Networks\n\nComp. Biol. and Bioinf. Program\n\nPrinceton Neuroscience Institute\n\nAdam S. Charles\n\nPrinceton University\nPrinceton, NJ, 08544\n\nadamsc@princeton.edu\n\nScott Gigante\n\nYale University\n\nNew Haven, CT 06511\n\nscott.gigante@yale.edu\n\nSmita Krishnaswamy\n\nDepts. of Genetics and Computer Science\n\nYale University\n\nNew Haven, CT 06520\n\nsmita.krishnaswamy@yale.edu\n\nGal Mishne\n\nHal\u0131c\u0131o\u02d8glu Data Science Institute\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\ngmishne@ucsd.edu\n\nAbstract\n\nUnderstanding why and how certain neural networks outperform others is key to\nguiding future development of network architectures and optimization methods.\nTo this end, we introduce a novel visualization algorithm that reveals the internal\ngeometry of such networks: Multislice PHATE (M-PHATE), the \ufb01rst method de-\nsigned explicitly to visualize how a neural network\u2019s hidden representations of data\nevolve throughout the course of training. We demonstrate that our visualization\nprovides intuitive, detailed summaries of the learning dynamics beyond simple\nglobal measures (i.e., validation loss and accuracy), without the need to access\nvalidation data. Furthermore, M-PHATE better captures both the dynamics and\ncommunity structure of the hidden units as compared to visualization based on\nstandard dimensionality reduction methods (e.g., ISOMAP, t-SNE). We demon-\nstrate M-PHATE with two vignettes: continual learning and generalization. In\nthe former, the M-PHATE visualizations display the mechanism of \u201ccatastrophic\nforgetting\u201d which is a major challenge for learning in task-switching contexts. In\nthe latter, our visualizations reveal how increased heterogeneity among hidden\nunits correlates with improved generalization performance. An implementation of\nM-PHATE, along with scripts to reproduce the \ufb01gures in this paper, is available at\nhttps://github.com/scottgigante/M-PHATE.\n\n1\n\nIntroduction\n\nDespite their massive increase in popularity in recent years, deep networks are still regarded as\nopaque and dif\ufb01cult to interpret or analyze. Understanding how and why certain neural networks\nperform better than others remains an art. The design of neural networks and their training: choice of\narchitectures, regularization, activation functions, and hyperparameters, while informed by theory and\nprior work, is often driven by intuition and tuned manually [1]. The combination of these intuition-\ndriven selections and long training times even on high-performance hardware (e.g., 3 weeks on 8\nGPUs for the popular ResNet-200 network for image classi\ufb01cation), means that the combinatorial\ntask of testing all possible choices is impossible, and must be guided by more principled evaluations\nand explorations.\nA natural and widely used measure of evaluation for the difference between network architectures and\noptimizers is the validation loss. In some situations, the validation loss lacks a clearly de\ufb01ned global\nmeaning, i.e., when the loss function itself is learned, and other evaluations are required [2, 3]. While\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsuch scores are useful for ranking models on the basis of performance, they crucially do not explain\nwhy one model outperforms another. To provide additional insight, visualization tools have been\nemployed, for example to analyze the \u201closs landscape\u201d of a network. Speci\ufb01cally, these visualizations\ndepict how architectural choices modify the smoothness of local minima [4, 5] \u2014 a quality assumed\nto be related to generalization abilities.\nLocal minima smoothness, however, is only one possible correlate of performance. Another internal\nquality that can be quanti\ufb01ed is the hidden representations of inputs provided by the hidden unit\nactivations. The multi-layered hidden representations of data are, in effect, the single most important\nfeature distinguishing neural networks from classical machine learning techniques in generalization [6\u2013\n10]. We can view the changes in representation by stochastic gradient descent as a dynamical system\nevolving from its random initialization to a converged low-energy state. Observing the progression of\nthis dynamical system gives more insight into the learning process than simply observing it at a single\npoint in time (e.g., after convergence.) In this paper, we contribute a novel method of inspecting a\nneural network\u2019s learning: we visualize the evolution of the network\u2019s hidden representation during\ntraining to isolate key qualities predictive of improved network performance.\nAnalyzing extremely high-dimensional objects such as deep neural networks requires methods that\ncan reduce these large structures into more manageable representations that are ef\ufb01cient to manipulate\nand visualize. Dimensionality reduction is a class of machine learning techniques which aim to reduce\nthe number of variables under consideration in high-dimensional data while maintaining the structure\nof a dataset. There exist a wide array of dimensionality reduction techniques designed speci\ufb01cally for\nvisualization, which aim speci\ufb01cally to capture the structure of a dataset in two or three dimensions\nfor the purposes of human interpretation, e.g., MDS [11], t-SNE [12], and Isomap [13]. In this paper,\nwe employ PHATE [14], a kernel-based dimensionality reduction method designed speci\ufb01cally for\nvisualization which uses multidimensional scaling (MDS) [11] to effectively embed the diffusion\ngeometry [15] of a dataset in two or three dimensions.\nIn order to visualize the evolution of the network\u2019s hidden representation, we take advantage of the\nlongitudinal nature of the data; we have in effect many observations of an evolving dynamical system,\nwhich lends itself well to building a graph from the data connecting observations across different\npoints in time. We construct a weighted multislice graph (where a \u201cslice\u201d refers to the network state\nat a \ufb01xed point in time) by creating connections between hidden representations obtained from a\nsingle unit across multiple epochs, and from multiple units within the same epoch. A pairwise af\ufb01nity\nkernel on this graph re\ufb02ects the similarity between hidden units and their evolution over time. This\nkernel is then dimensionality reduced with PHATE and visualized in two dimensions.\nThe main contributions of this paper are as follows. We present Multislice PHATE (M-PHATE),\nwhich combines a novel multislice kernel construction with the PHATE visualization [14]. Our\nkernel captures the dynamics of an evolving graph structure, that when when visualized, gives\nunique intuition about the evolution of a neural network over the course of training and re-training.\nWe compare M-PHATE to other dimensionality reduction techniques, showing that the combined\nconstruction of the multislice kernel and the use of PHATE provide signi\ufb01cant improvements to\nvisualization. In two vignettes, we demonstrate the use M-PHATE on established training tasks and\nlearning methods in continual learning, and in regularization techniques commonly used to improve\ngeneralization performance. These examples draw insight into the reasons certain methods and\narchitectures outperform others, and demonstrate how visualizing the hidden units of a network with\nM-PHATE provides additional information to a deep learning practitioner over classical metrics such\nas validation loss and accuracy, all without the need to access validation data.\n\n2 Background\n\nDiffusion maps (DMs) [15] is an important nonlinear dimensionality reduction method that has been\nused to extract complex relationships between high-dimensional data [16\u201322]. PHATE [14] aims to\noptimize diffusion maps for data visualization. We brie\ufb02y review the two approaches.\nGiven a high-dimensional dataset {xi}, DMs operate on a pairwise similarity matrix W (e.g.,\ncomputed via a Gaussian kernel W(xi, xj) = exp{\u2212(cid:107)xi \u2212 xj(cid:107)2/\u0001}). and return an embedding of\nthe data in a low-dimensional Euclidean space. To compute this embedding, the rows of W are\nj Wij. The resulting matrix P can be interpreted as the\ntransition matrix of a Markov chain over the dataset and powers of the matrix, Pt, represents running\n\nnormalized by P = D\u22121W, where Dii =(cid:80)\n\n2\n\n\fthe Markov chain forward t steps. The matrix P thus has a complete sequence of bi-orthogonal\nleft and right eigenvectors \u03c6i, \u03c8i, respectively, and a corresponding sequence of eigenvalues 1 =\n\u03bb0 \u2265 |\u03bb1| \u2265 |\u03bb2| \u2265 . . .. Due to the fast spectrum decay of {\u03bbl}, we can obtain a low-dimensional\nrepresentation of the data using only the top (cid:96) eigenvectors. Diffusion maps, de\ufb01ned as \u03a8t(x) =\n(cid:96)\u03c8(cid:96)(x)), embeds the data points into a Euclidean space R(cid:96) where the\n(\u03bbt\nEuclidean distance approximates the diffusion distance:\n\n2\u03c82(x), . . . , \u03bbt\n\n1\u03c81(x), \u03bbt\n\n(cid:88)\n\nD2\n\nt (xi, xj) =\n\n(pt(xi, xk) \u2212 pt(xj, xk))2\n\nxk\n\n\u03c60(xj)\n\n\u2248 (cid:107)\u03a8t(xi) \u2212 \u03a8t(xj)(cid:107)2\n\n2\n\nNote that \u03c80 is neglected because it is a constant vector.\nTo enable successful data visualization, a method must reduce the dimensionality to two or three\ndimensions; diffusion maps, however, reduces only to the intrinsic dimensionality of the data, which\nmay be much higher. Thus, to calculate a 2D or 3D representation of the data, PHATE applies\nMDS [11] to the informational distance between rows i and j of the diffusion kernel Pt de\ufb01ned as\n\n\u03a6t(i, j) = (cid:107) log Pt(i) \u2212 log Pt(j)(cid:107)2\n\nwhere t is selected automatically as the knee point of the Von Neumann Entropy of the diffusion\noperator. For further details, see Moon et al. [14].\n\n2.1 Related work\n\nWe consider the evolving state of a neural network\u2019s hidden units as a dynamical system which can\nbe represented as a multislice graph on which we construct a pairwise af\ufb01nity kernel. Such a kernel\nconsiders both similarities between hidden units in the same epoch or time-slice (denoted intraslice\nsimilarities) and similarities of a hidden unit to itself across different time-slices (denoted interslice\nsimilarities). The concept of constructing a graph for data changing over time is motivated by prior\nwork both in harmonic analysis [20, 23\u201325, 22] and network science [26]. For example, Coifman\nand Hirn [20] suggest an algorithm for jointly analyzing DMs built over data points that are changing\nover time by aligning the separately constructed DMs, while Mucha et al. [26] suggest an algorithm\nfor community detection in multislice networks by connecting each node in one network slice to\nitself in other slices, with identical \ufb01xed weights for all intraslice connections. In both cases, such\ntechniques are designed to detect changes in intraslice dynamics over time, yet interslice dynamics\nare not incorporated into the model.\n\n3 Multiscale PHATE\n\n3.1 Preliminaries\n\nLet F be a neural network with a total of m hidden units applied to d-dimensional input data. Let\nFi : Rd \u2192 R be the activation of the ith hidden unit of F , and F (\u03c4 ) be the representation of the\nnetwork after being trained for \u03c4 \u2208 {1, . . . , n} epochs on training data X sampled from a dataset X .\nA natural feature space for the hidden units of F is the activations of the units with respect to the\ninput data. Let Y \u2282 X be a representative sample of p (cid:28) |X| points. (In this paper, we use points\nnot used in training; however, this is not necessary. Further discussion of this is given in Section S2.)\nLet Yk be the kth sample in Y . We use the hidden unit activations F (Y ) to compute a shared feature\nspace of dimension p for the hidden units. We can then calculate similarities between units from all\nlayers. Note that one may instead consider the hidden units\u2019 learned parameters (e.g. weight matrices\nand bias terms); however, these are not suitable for our purposes as they are not necessarily the same\nshape between hidden layers, and additionally the parameters may contain information not relevant\nto the data (for example, in dimensions of X containing no relevant information.)\nWe denote the time trace T of the network as a n \u00d7 m \u00d7 p tensor containing the activations at each\nepoch \u03c4 of each hidden unit Fi with respect to each sample Yk \u2208 Y . We note that in practice, the\nmajor driver of variation in T is the bias term contributing a \ufb01xed value to the activation of each\nhidden unit. Further, we note that the absolute values of the differences in activation of a hidden unit\nare not strictly meaningful, since any differences in activation can simply be magni\ufb01ed by a larger\n\n3\n\n\fkernel weight in the following layer. Therefore, to calculate more meaningful similarities, we \ufb01rst\nz-score the activations of each hidden unit at each epoch \u03c4\n\nF (\u03c4 )\n\ni\n\nT(\u03c4, i, k) =\n\n3.2 Multislice Kernel\n\n(cid:80)\n\n(Yk) \u2212 1\n\np\n\n(cid:113)\n\nVar(cid:96) F (\u03c4 )\n\ni\n\n(Y(cid:96))\n\n(cid:96) F (\u03c4 )\n\ni\n\n(Y(cid:96))\n\n.\n\nThe time trace gives us a natural substrate from which to construct a visualization of the network\u2019s\nevolution. We construct a kernel over T utilizing our prior knowledge of the temporal aspect of T to\ncapture its dynamics. Let K be a nm \u00d7 nm kernel matrix between all hidden units at all epochs (the\n(\u03c4 m + j)th row or column of K refers to j-th unit at epoch \u03c4).We henceforth refer to the (\u03c4 m + j)th\nrow of K as K((\u03c4, j), :) and the (\u03c4 m + j)th column of K as K(:, (\u03c4, j)).\nTo capture both the evolution of a hidden unit throughout training as well as its community structure\nwith respect to other hidden units, we construct a multislice kernel matrix which re\ufb02ects both af\ufb01nities\nbetween hidden units i and j in the same epoch \u03c4, or intraslice af\ufb01nities\n\nK(\u03c4 )\n\nintraslice(i, j) = exp\n\n(cid:16)\u2212(cid:107)T(\u03c4, i) \u2212 T(\u03c4, j)(cid:107)\u03b1\ninterslice(\u03c4, \u03c5) = exp(cid:0)\u2212(cid:107)T(\u03c4, i) \u2212 T(\u03c5, i)(cid:107)2\n2/\u00012(cid:1)\n\n2 /\u03c3\u03b1\n\nK(i)\n\n(\u03c4,i)\n\n(cid:17)\n\nas well as af\ufb01nities between a hidden unit i and itself at different epochs, or interslice af\ufb01nities\n\nwhere \u03c3(\u03c4,i) is the intraslice bandwidth for unit i at epoch \u03c4, \u0001 is the \ufb01xed intraslice bandwidth, and\n\u03b1 is the adaptive bandwidth decay parameter.\nIn order to maintain connectivity while increasing robustness to parameter selection for the intraslice\naf\ufb01nities K(\u03c4 )\nintraslice, we use an adaptive-bandwidth Gaussian kernel (termed the alpha-decay ker-\nnel [14]), with bandwidth \u03c3(\u03c4,i) set to be the distance of unit i at epoch \u03c4 to its kth nearest neighbor\nacross units at that epoch: \u03c3(\u03c4,i) = dk(T(\u03c4, i), T(\u03c4, :)), where dk(x, X) denotes the L2 distance\nfrom x to its kth nearest neighbor in X. Note that the use of the adaptive bandwidth means that the\nkernel is not symmetric and will require symmetrization. In order to allow the kernel to represent\nchanging dynamics of units over the course of learning, we use a \ufb01xed-bandwidth Gaussian kernel in\nthe interslice af\ufb01nities K(i)\ninterslice, where \u0001 is the average across all epochs and all units of the distance\nof unit i at epoch \u03c4 to its \u03bath nearest neighbor among the set consisting of the same unit i at all other\nepochs \u0001 = 1\nnm\nFinally, the multislice kernel matrix contains one row and column for each unit at each epoch, such\nthat the intraslice af\ufb01nities form a block diagonal matrix and the interslice af\ufb01nities form off-diagonal\nblocks composed of diagonal matrices (see Figures S1 and S2 for a diagram):\n\ni=1 d\u03ba(T(\u03c4, i), T(:, i)).\n\n(cid:80)m\n\n(cid:80)n\n\n\u03c4 =1\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3K(\u03c4 )\n\nK(i)\n0,\n\nintraslice(i, j),\nintraslice(\u03c4, \u03c5),\n\nif \u03c4 = \u03c5;\nif i = j;\notherwise.\n\nK((\u03c4, i), (\u03c5, j)) =\n\nWe symmetrize this kernel as K(cid:48) = 1\n2 (K + KT ), and row normalize it to obtain P = D\u22121K, which\nrepresents a random walk over all units across all epochs, where propagating from (\u03c4, i) to (\u03bd, j) is\nconditional on the transition probabilities between epochs \u03c4 and \u03bd. PHATE [14] is applied to P to\nvisualize the time trace T in two or three dimensions.\n\n4 Results\n\n4.1 Example visualization\n\nTo demonstrate our visualization, we train a feedforward neural network with 3 layers of 64 hidden\nunits to classify digits in MNIST [27]. The visualization is built on the time trace T evaluated on the\nnetwork over a single round of training that lasted 300 epochs and reached 96% validation accuracy.\n\n4\n\n\fFigure 1: Visualization of a simple 3-layer MLP trained on MNIST with M-PHATE. Visualization is\ncolored by epoch (left), hidden layer (center), and most active digit for each unit (right).\n\nWe visualize the network using M-PHATE (Fig. 1) colored by epoch, hidden layer and the digit\nfor which examples of that digit most strongly activate the hidden unit. The embedding is clearly\norganized longitudinally by epoch, with larger jumps between early epochs and gradually smaller\nsteps as the network converges. Additionally, increased structure emerges in the latter epochs as\nthe network learns meaningful representations of the digits, and groups of neurons activating on the\nsame digits begin to co-localize. Neurons of different layers frequently co-localize, showing that our\nvisualization allows meaningful comparison of hidden units in different hidden layers.\n\n4.2 Comparison to other visualization methods\n\nTo evaluate the quality of the M-PHATE visualization, we compare to three established visualization\nmethods: diffusion maps, t-SNE and ISOMAP. We also compare our multislice kernel to the standard\nformalism of these visualization techniques, by computing pairwise distances or af\ufb01nities between all\nunits at all time points without taking into account the multislice nature of the data.\n\nFigure 2: Comparison of standard application of visualization algorithms. Each point represents a\nhidden unit at a given epoch during training and is colored by the epoch.\n\nFigure 2 shows the standard and multislice visualizations for all four dimensionality reduction\ntechniques of the network in Section 4.1. For implementation details, see Section S3. Only the\nMultislice PHATE visualization reveals any meaningful evolution of the neural network over time.\nTo quantify the quality of the visualization, we compare both interslice and intraslice neighborhoods\nin the embedding to the equivalent neighborhoods in the original data. Speci\ufb01cally, for a visualization\nV we de\ufb01ne the intraslice neighborhood preservation of a point V (t, i) \u2208 V as\n\nV (\u03c4,:)(V (\u03c4, i)) \u2229 N k\n\nT (\u03c4,:)(T (\u03c4, i))\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)N k\n\n1\n|k|\n\n5\n\n\fTable 1: Neighborhood preservation of visualization methods applied to a FFNN classifying MNIST.\n\nMultislice\nDM Isomap\n0.11\n0.19\n0.58\n0.79\n0.25\n0.36\n0.78\n0.75\n0.61\n0.61\n\nPHATE\n0.26\n0.95\n0.45\n0.93\n0.81\n\nt-SNE\n0.13\n0.91\n0.26\n0.92\n0.33\n\nPHATE\n0.05\n0.47\n0.21\n0.67\n0.25\n\nStandard\n\nDM Isomap\n0.06\n0.09\n0.44\n0.68\n0.22\n0.26\n0.70\n0.54\n0.13\n0.47\n\nt-SNE\n0.06\n0.96\n0.22\n0.94\n-0.04\n\nIntraslice, k = 10\nInterslice, k = 10\nIntraslice, k = 40\nInterslice, k = 40\nLoss Correlation\n\nand the interslice neighborhood preservation of V (t, i) as\nV (:,i)(V (\u03c4, j)) \u2229 N k\n\nT (:,i)(T (\u03c4, i))\n\n(cid:12)(cid:12)(cid:12)N k\n\n1\n|k|\n\n(cid:12)(cid:12)(cid:12)\n\nwhere N k\nX (x) denotes the k nearest neighbors of x in X. We also calculate the Spearman correlation\nof the rate of change of each hidden unit with the rate of change of the validation loss to quantify the\n\ufb01delity of the visualization to the diminishing rate of convergence towards the end of training.\nM-PHATE achieves the best neighborhood preservation on all measures except the interslice neigh-\nborhood preservation, in which it performs on-par with standard t-SNE. Additionally, the multislice\nkernel construction outperforms the corresponding standard kernel construction for all methods and\nall measures, except again in the case of t-SNE for interslice neighborhood preservation. M-PHATE\nalso has the highest correlation with change in loss, making it the most faithful display of network\nconvergence.\n\n4.3 Continual learning\n\nAn ongoing challenge in arti\ufb01cial intelligence is in making a single model perform well on many tasks\nindependently. The capacity to succeed at dynamically changing tasks is often considered a hallmark\nof genuine intelligence, and is thus crucial to develop in arti\ufb01cial intelligence [28]. Continual learning\nis one attempt at achieving this goal sequentially training a single network on different tasks with the\naim of instilling the network with new abilities as data becomes available.\nTo assess networks designed for continual learning tasks, a set of training baselines have been pro-\nposed. Hsu et al. [29] de\ufb01ne three types of continual learning scenarios for classi\ufb01cation: incremental\ntask learning, in which a separate binary output layer is used for each task; incremental domain\nlearning, in which a single binary output layer performs all tasks; and incremental class learning, in\nwhich a single 10-unit output layer is used, with each pair of output units used for just a single task.\nFurther details are given in Section S4.\nWe implemented a 2-layer MLP with 400 units in each hidden layer to perform incremental, domain\nand class learning tasks using three described baselines: standard training with Adagrad [30] and\nAdam [31], and an experience replay training scheme called Naive Rehearsal [29] in which a small\nset of training examples from each task are retained and replayed to the network during subsequent\ntasks. Each network was trained for 4 epochs before switching to the next task. Overall, we \ufb01nd\nthat validation performance is fairly consistent with results reported in Hsu et al. [29], with Naive\nRehearsal performing best, followed by Adagrad and Adam. Class learning was the most challenging,\nfollowed by domain learning and task learning.\nFigure 3 shows M-PHATE visualizations of learning in networks trained in each of three baselines,\nwith network slices taken every 50 batches rather than every epoch for increased resolution. Notably,\nwe observe a stark difference in how structure is preserved over training between networks, which is\npredictive of task performance. The highest-performing networks all tend to preserve representational\nstructure across changing tasks. On the other hand, networks trained with Adam \u2014 the worst\nperforming combinations \u2014 tend to have a structural \u201ccollapse\u201d, or rapid change in connectivity, as\nthe tasks switch, consistent with the rapid change (and eventual increase) in validation loss.\nFurther, the frequency of neighborhood changes for hidden units throughout training (appearing as a\ncrossing of unit trajectories in the visualization) corresponds to an increase in validation loss; this is\ndue to a change in function of the hidden units, corrupting the intended use of such units for earlier\n\n6\n\n\fFigure 3: Visualization of a 2 layer MLP trained on Split MNIST for \ufb01ve-task continual learning of\nbinary classi\ufb01cation. Training loss and accuracy are reported on the current task. Validation loss and\naccuracy is reported on a test set consisting of an even number of samples from all tasks. Only 100\nneurons are shown for clarity. Full plots are available in Section S4.\n\ntasks. We quantify this effect by calculating the Adjusted Rand Index (ARI, Santos and Embrechts\n[32]) on cluster assignments computed on the subset of the visualization corresponding to the hidden\nunits pre- and post-task switch, and \ufb01nd that the average ARI is strongly negatively correlated with\nthe network\u2019s \ufb01nal validation loss averaged over all tasks (\u03c1 = 0.94). Results are similar for the same\nexperiment run in CIFAR10 (\u03c1 = 0.86, see Section S4).\nLooking for such signatures, including rapid changes in hidden unit structure and crossing of unit\ntrajectories, can thus be used to understand the ef\ufb01ciency of continual learning architectures.\n\n4.4 Generalization\n\nDespite being massively overparametrized, neural networks frequently exhibit astounding generaliza-\ntion performance [33, 34]. Recent work has showed that, despite having the capacity to memorize,\nneural networks tend to learn abstract, generalizable features rather than memorizing each example,\nand that this behaviour is qualitatively different in gradient descent compared to memorization [35].\n\n7\n\n\fTable 2: Adjusted Rand Index of cluster assignments computed on the subset of the PHATE visualiza-\ntion corresponding to the hidden units pre- and post-task switch. ARI is averaged across all four task\nswitches, 6 different choices of clustering parameter (between 3\u20138 clusters) and 20 random seeds.\nLoss refers to average validation loss averaged over all tasks after completion of training.\n\nTask\n\nDomain\n\nClass\n\nVal. Loss\nARI\n\n0.047\n0.741\n\n0.104\n0.772\n\nRehears. Adagr. Adam Rehears. Adagr. Adam Rehears. Adagr. Adam\n4.156\n0.466\n\n0.042\n0.716\n\n0.709\n0.719\n\n2.904\n0.614\n\n0.462\n0.768\n\n1.062\n0.740\n\n1.884\n0.632\n\nTable 3: Summed variance per epoch of the PHATE visualization is associated with the difference\nbetween a network that is memorizing and a network that is generalizing. Memorization error refers\nto the difference between train loss and validation loss.\n\nMemorization\nVariance\n\nDropout\n-0.09\n382\n\nKernel\nL1\n0.02\n141\n\nL2 Vanilla\n0.04\n46\n\n0.03\n50\n\nActivity\nL1\n0.11\n0.47\n\nL2\n0.12\n0.15\n\nRandom\n\nLabels\n0.15\n0.42\n\nPixels\n0.92\n0.03\n\nIn order to demonstrate the difference between networks that learn to generalize and networks that\nlearn to memorize, we train a 3-layer MLP with 128 hidden units in each layer to classify MNIST\nwith: no regularization; L1/L2 weight regularization; L1/L2 activity regularization; and dropout.\nAdditionally, we train the same network to classify MNIST with random labels, as well as to classify\nimages with randomly valued pixels, such networks being examples of pure memorization. Each\nnetwork was trained for 300 epochs, and the discrepancy between train and validation loss reported.\nWe note that in Figure 4, the networks with the poorest generalization (i.e.\nthose with greatest\ndivergence between train and validation loss), especially Activity L1 and Activity L2, display less\nheterogeneity in the visualization. To quantify this, we calculate the sum of the variance for all time\nslices of each embedding and regress this against the memorization error of each network, de\ufb01ned\nas the discrepancy between train and test loss after 300 epochs (Table 3), achieving a Spearman\ncorrelation of \u03c1 = \u22120.98. Results are similar for the same experiment run in CIFAR10 (\u03c1 = \u22120.97,\nsee Section S5).\n\nFigure 4: Visualization of a 3-layer MLP trained to classify MNIST with different regularizations or\nmanipulations applied to affect generalization performance.\n\nTo understand this phenomenon, we consider the random labels network. In order to memorize\nrandom labels, the neural network must hone in on minute differences between images of the\nsame true class in order to classify them differently. Since most images won\u2019t satisfy such speci\ufb01c\n\n8\n\n\fcriteria most nodes will not respond to any given image, leading to low activation heterogeneity and\nhigh similarities between hidden units. The M-PHATE visualization clearly exposes this intuition\nvisually, depicting very little difference between these hidden units. Similar intuition can be drawn\nfrom the random pixels network, in which the difference between images is purely random. We\nhypothesize that applying L1 or L2 regularization over the activations has a qualitatively similar\neffect; reducing the variability in activations and effectively over-emphasizing small differences in\nthe hidden representation. This behavior effectively mimics the effects of memorization.\nOn the other hand, we consider the dropout network, which displays the greatest heterogeneity. Initial\nintuition evoked the idea that dropout emulates an ensemble method within a single network; by\nrandomly removing units from the network during training, the network learns to combine the output\nof many sub-networks, each of which is capable of correctly classifying the input Srivastava et al. [36].\nM-PHATE visualization of training with dropout recommends a more mechanistic version of this\nintuition: dropped-out nodes are protected from receiving the exact same gradient signals and diverge\nto a more expressive representation. The resulting heterogeneity in the network reduces the reliance\non small differences between training examples and heightens the network\u2019s capacity to generalize.\nThis intuition falls in line with other theoretical explorations, such as viewing dropout as a form of\nBayesian regularization [37] or stochastic gradient descent [38] and reinforces our understanding of\nwhy dropout induces generalization.\nWe note that while this experiment uses validation data as input to M-PHATE, we have repeated\nthis experiment in Section S2 and show equivalent results. In doing so, we provide a mechanism to\nunderstand the generalization performance of a network without requiring access to validation data.\n\n5 Conclusion\n\nHere we have introduced a novel approach to examining the process of learning in deep neural\nnetworks through a visualization algorithm we call M-PHATE. M-PHATE takes advantage of the\ndynamic nature of the hidden unit activations over the course of training to provide an interpretable vi-\nsualization otherwise unattainable with standard visualizations. We demonstrate M-PHATE with two\nvignettes in continual learning and generalization, drawing conclusions that are not apparent without\nsuch a visualization, and providing insight into the performance of networks without necessarily\nrequiring access to validation data. In doing so, we demonstrate the utility of such a visualization to\nthe deep learning practitioner.\n\nAcknowledgments\n\nThis work was partially supported by the Gruber Foundation [S.G.]; the Chan-Zuckerberg Initiative\n(grant ID: 182702) and the National Institute of General Medical Sciences of the National Institutes\nof Health (grant ID: R01GM130847) [S.K.]; and the National Institute of Neurological Disorders and\nStroke of the National Institutes of Health (grant ID: R01EB026936) [G.M.].\n\nReferences\n[1] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking\nthe human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104\n(1):148\u2013175, 2016. doi: 10.1109/JPROC.2015.2494218. URL https://doi.org/10.1109/\nJPROC.2015.2494218.\n\n[2] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems\n29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,\n2016, Barcelona, Spain, pages 2226\u20132234, 2016. URL http://papers.nips.cc/paper/\n6125-improved-techniques-for-training-gans.\n\n[3] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans\ncreated equal? A large-scale study. In Advances in Neural Information Processing Systems\n31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8\nDecember 2018, Montr\u00e9al, Canada., pages 698\u2013707, 2018. URL http://papers.nips.cc/\npaper/7350-are-gans-created-equal-a-large-scale-study.\n\n9\n\n\f[4] Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization\nproblems. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego,\nCA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/\nabs/1412.6544.\n\n[5] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss\nlandscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December\n2018, Montr\u00e9al, Canada., pages 6391\u20136401, 2018. URL http://papers.nips.cc/paper/\n7875-visualizing-the-loss-landscape-of-neural-nets.\n\n[6] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436\u2013\n\n444, 2015. doi: 10.1038/nature14539. URL https://doi.org/10.1038/nature14539.\n\n[7] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. The curse of highly variable func-\nIn Advances in Neural Information Processing Systems\ntions for local kernel machines.\n18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver,\nBritish Columbia, Canada], pages 107\u2013114, 2005. URL http://papers.nips.cc/paper/\n2810-the-curse-of-highly-variable-functions-for-local-kernel-machines.\n\n[8] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine\nLearning, 2(1):1\u2013127, 2009. doi: 10.1561/2200000006. URL https://doi.org/10.1561/\n2200000006.\n\n[9] Guido F. Mont\u00fafar and Jason Morton. When does a mixture of products contain a product\nof mixtures? SIAM J. Discrete Math., 29(1):321\u2013347, 2015. doi: 10.1137/140957081. URL\nhttps://doi.org/10.1137/140957081.\n\n[10] Guido F. Mont\u00fafar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number\nof linear regions of deep neural networks. In Advances in Neural Information Processing\nSystems 27: Annual Conference on Neural Information Processing Systems 2014, December\n8-13 2014, Montreal, Quebec, Canada, pages 2924\u20132932, 2014. URL http://papers.nips.\ncc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.\n\n[11] Trevor F Cox and Michael AA Cox. Multidimensional scaling. Chapman and hall/CRC, 2000.\n\n[12] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[13] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[14] Kevin R Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel Burkhardt, William Chen,\nAntonia van den Elzen, Matthew J Hirn, Ronald R Coifman, Natalia B Ivanova, Guy Wolf,\nand Smita Krishnaswamy. Visualizing transitions and structure for high dimensional data\nexploration. bioRxiv, page 120378, 2017.\n\n[15] Ronald R Coifman and St\u00e9phane Lafon. Diffusion maps. Applied and computational harmonic\n\nanalysis, 21(1):5\u201330, 2006.\n\n[16] J. He, L. Zhang, Q. Wang, and Z. Li. Using diffusion geometric coordinates for hyperspectral\n\nimagery representation. IEEE Geosci. Remote Sens. Letters, 6(4):767\u2013771, Oct. 2009.\n\n[17] Zeev Farbman, Raanan Fattal, and Dani Lischinski. Diffusion maps for edge-aware image\n\nediting. ACM Trans. Graph., 29(6):145:1\u2013145:10, Dec. 2010.\n\n[18] Ronen Talmon, Israel Cohen, and Sharon Gannot. Single-channel transient interference sup-\npression with diffusion maps. IEEE Trans. Audio, Speech Lang. Process., 21(1):130\u2013142, Apr.\n2012.\n\n[19] Gal Mishne and Israel Cohen. Multiscale anomaly detection using diffusion maps. IEEE J. Sel.\n\nTopics Signal Process., 7:111 \u2013 123, Feb. 2013.\n\n10\n\n\f[20] Ronald R Coifman and Matthew J Hirn. Diffusion maps for changing data. Applied and\n\ncomputational harmonic analysis, 36(1):79\u2013107, 2014.\n\n[21] Gal Mishne, Ronen Talmon, Ron Meir, Jackie Schiller, Maria Lavzin, Uri Dubin, and Ronald R.\nCoifman. Hierarchical coupled-geometry analysis for neuronal structure and activity pattern\ndiscovery. IEEE Journal of Selected Topics in Signal Processing, 10(7):1238\u20131253, Oct 2016.\nISSN 1932-4553. doi: 10.1109/JSTSP.2016.2602061.\n\n[22] Ralf Banisch and P\u00e9ter Koltai. Understanding the geometry of transport: Diffusion maps\nfor lagrangian trajectory data unravel coherent sets. Chaos: An Interdisciplinary Journal of\nNonlinear Science, 27(3):035804, 2017. doi: 10.1063/1.4971788. URL https://doi.org/\n10.1063/1.4971788.\n\n[23] O\ufb01r Lindenbaum, Arie Yeredor, Moshe Salhov, and Amir Averbuch. Multi-view diffusion maps.\n\nInformation Fusion, 2019.\n\n[24] Roy R Lederman and Ronen Talmon. Learning the geometry of common latent variables using\nalternating-diffusion. Applied and Computational Harmonic Analysis, 44(3):509\u2013536, 2018.\n\n[25] Nicholas F Marshall and Matthew J Hirn. Time coupled diffusion maps. Applied and Computa-\n\ntional Harmonic Analysis, 45(3):709\u2013728, 2018.\n\n[26] Peter J Mucha, Thomas Richardson, Kevin Macon, Mason A Porter, and Jukka-Pekka Onnela.\nCommunity structure in time-dependent, multiscale, and multiplex networks. Science, 328\n(5980):876\u2013878, 2010.\n\n[27] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[28] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter.\nContinual lifelong learning with neural networks: A review. Neural Networks, 113:54\u201371, 2019.\ndoi: 10.1016/j.neunet.2019.01.012. URL https://doi.org/10.1016/j.neunet.2019.01.\n012.\n\n[29] Yen-Chang Hsu, Yen-Cheng Liu, and Zsolt Kira. Re-evaluating continual learning scenarios:\nA categorization and case for strong baselines. CoRR, abs/1810.12488, 2018. URL http:\n//arxiv.org/abs/1810.12488.\n\n[30] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online\nlearning and stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159,\n2011. URL http://dl.acm.org/citation.cfm?id=2021068.\n\n[31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd\nInternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May\n7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.\n[32] Jorge M Santos and Mark Embrechts. On the use of the adjusted rand index as a metric for\nevaluating supervised classi\ufb01cation. In International conference on arti\ufb01cial neural networks,\npages 175\u2013184. Springer, 2009.\n\n[33] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In 5th International Conference on Learning\nRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,\n2017. URL https://openreview.net/forum?id=Sy8gdB9xx.\n\n[34] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overpa-\nrameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018. URL\nhttp://arxiv.org/abs/1811.04918.\n\n[35] Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and\nSimon Lacoste-Julien. A closer look at memorization in deep networks. In Proceedings of\nthe 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia,\n6-11 August 2017, pages 233\u2013242, 2017. URL http://proceedings.mlr.press/v70/\narpit17a.html.\n\n11\n\n\f[36] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014. URL http://dl.acm.org/citation.cfm?\nid=2670313.\n\n[37] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning, pages 1050\u20131059,\n2016.\n\n[38] Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural information\n\nprocessing systems, pages 2814\u20132822, 2013.\n\n12\n\n\f", "award": [], "sourceid": 1058, "authors": [{"given_name": "Scott", "family_name": "Gigante", "institution": "Yale University"}, {"given_name": "Adam", "family_name": "Charles", "institution": "Princeton University"}, {"given_name": "Smita", "family_name": "Krishnaswamy", "institution": "Yale University"}, {"given_name": "Gal", "family_name": "Mishne", "institution": "UC San Diego"}]}