{"title": "Analysis of Unstandardized Contributions in Cross Connected Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "Analysis of Unstandardized Contributions \n\nin Cross Connected Networks \n\nThomas R. Shultz \nshultz@psych.mcgill.ca \n\nYuriko Oshima-Takane \n\nyuriko@psych.mcgill.ca \n\nYoshio Takane \n\ntakane@psych.mcgill.ca \n\nDepartment of Psychology \n\nMcGill University \n\nMontreal, Quebec, Canada H3A IBI \n\nAbstract \n\nUnderstanding knowledge representations in neural nets has been a \ndifficult problem. Principal components analysis (PCA) of \ncontributions (products of sending activations and connection weights) \nhas yielded valuable insights into knowledge representations, but much \nof this work has focused on the correlation matrix of contributions. The \npresent work shows that analyzing the variance-covariance matrix of \ncontributions yields more valid insights by taking account of weights. \n\n1 INTRODUCTION \nThe knowledge representations learned by neural networks are usually difficult to \nunderstand because of the non-linear properties of these nets and the fact that knowledge is \noften distributed across many units. Standard network analysis techniques, based on a \nnetwork's connection weights or on its hidden unit activations, have been limited. Weight \ndiagrams are typically complex and weights vary across mUltiple networks trained on the \nsame problem. Analysis of activation patterns on hidden units is limited to nets with a \nsingle layer of hidden units without cross connections. \nCross connections are direct connections that bypass intervening hidden unit layers. They \nincrease learning speed in static networks by focusing on linear relations (Lang & \nWitbrock, 1988) and are a standard feature of generative algorithms such as cascade(cid:173)\ncorrelation (Fahlman & Lebiere, 1990). Because such cross connections do so much of \nthe work, analyses that are restricted to hidden unit activations furnish only a partial \npicture of the network's knowledge. \nContribution analysis has been shown to be a useful technique for multi-layer, cross \nconnected nets. Sanger (1989) defined a contribution as the product of an output weight, \nthe activation of a sending unit, and the sign of the output target for that input. Such \ncontributions are potentially more informative than either weights alone or hidden unit \nactivations alone since they take account of both weight and sending activation. Shultz \nand Elman (1994) used PCA to reduce the dimensionality of such contributions in several \ndifferent types of cascade-correlation nets. Shultz and Oshima-Takane (1994) demonstrated \nthat PCA of unscaled contributions produced even better insights into cascade-correlation \nsolutions than did comparable analyses of contributions scaled by the sign of output \ntargets. Sanger (1989) had recommended scaling contributions by the signs of output \ntargets in order to determine whether the contributions helped or hindered the network's \nsolution. But since the signs of output targets are only available to networks during error \n\n\f602 \n\nThomas R. Shultz, Yuriko Oshima-Takane, Yoshio Takane \n\ncorrection learning, it is more natural to use unscaled contributions in analyzing \nknowledge representations. \nThere is an issue in PCA about whether to use the correlation matrix or the variance(cid:173)\ncovariance matrix. The correlation matrix contains Is in the diagonal and Pearson \ncorrelation coefficients between contributions off the diagonal. This has the effect of \nstandardizing the variables (contributions) so that each has a mean of 0 and standard \ndeviation of 1. Effectively, this ensures that the PCA of a correlation matrix exploits \nvariation in input activation patterns but ignores variation in connection weights (because \nvariation in connection weights is eliminated as the contributions are standardized). \nHere, we report on work that investigates whether more useful insights into network \nknowledge structures can be revealed by PCA of un standardized contributions. To do this, \nwe apply PCA to the variance-covariance matrix of contributions. The variance-covariance \nmatrix has contribution variances along the diagonal and covariances between \ncontributions off the diagonal. Taking explicit account of the variation in connection \nweights in this way may produce a more valid picture of the network's knowledge. \nWe use some of the same networks and problems employed in our earlier work (Shultz & \nElman, 1994; Shultz & Oshima-Takane, 1994) to facilitate comparison of results. The \nproblems include continuous XOR, arithmetic comparisons involving addition and \nmUltiplication, and distinguishing between two interlocking spirals. All of the nets were \ngenerated with the cascade-correlation algorithm (Fahlman & Lebiere, 1990). \nCascade-correlation begins as a perceptron and recruits hidden units into the network as it \nneeds them in order to reduce error. The recruited hidden unit is the one whose activations \ncorrelate best with the network's current error. Recruited units are installed in a cascade, \neach on a separate layer and receiving input from the input units and from any previously \nexisting hidden units. We used the default values for all cascade-correlation parameters. \nThe goal of understanding knowledge representations learned by networks ought to be \nuseful in a variety of contexts. One such context is cognitive modeling, where the ability \nof nets to merely simulate psychological phenomena is not sufficient (McCloskey, \n1991). In addition, it is important to determine whether the network representations bear \nany systematic relation to the representations employed by human subjects . \n\n2 PCA OF CONTRIBUTIONS \nSanger's (1989) original contribution analysis began with a three-dimensional array of \ncontributions (output unit x hidden unit x input pattern). In contrast, we start with a two(cid:173)\ndimensional output weight x input pattern array of contributions. This is more efficient \nthan the slicing technique used by Sanger to focus on particular output or hidden units and \nstill allows identification of the roles of specific contributions (Shultz & Elman, 1994; \nShultz & Oshima-Takane, 1994). \nWe subject the variance-covariance matrix of contributions to PCA in order to identify the \nmain dimensions of variation in the contributions (Jolliffe, 1986). A component is a line \nof best fit to a set of data points in multi-dimensional space. The goal of PCA is to \nsummarize a multivariate data set with a relatively small number of components by \ncapitalizing on covariance among the variables (in this case, contributions). \nWe use the scree test (Cattell, 1966) to determine how many components are useful to \ninclude in the analysis. Varimax rotation is applied to improve the interpretability of the \nsolution. Component scores are plotted to identify the function of each component \n\n3 APPLICATION TO CONTINUOUS XOR \nThe classical binary XOR problem does not have enough training patterns to make \ncontribution analysis worthwhile. However, we constructed a continuous version of the \nXOR problem by dividing the input space into four quadrants. Starting from 0.1, input \nvalues were incremented in steps of 0.1, producing 100 x, y input pairs that can be \npartitioned into four quadrants of the input space. Quadrant a had values of x less than \n\n\fAnalysis of Unstandardized Contributions in Cross Connected Networks \n\n603 \n\n0.55 combined with values of y above 0.55. Quadrant b had values of x and y greater than \n0.55. Quadrant c had values of x and y less than 0.55. Quadrant d had values of x greater \nthan 0.55 combined with values of y below 0.55. Similar to binary XOR, problems from \nquadrants a and d had a positive output target (0.5) for the net, whereas problems from \nquadrants band c had a negative output target (-0.5). There was a single output unit with \na sigmoid activation. \nThree cascade-correlation nets were trained on continuous XOR. Each of these nets \ngenerated a unique solution, recruiting five or six hidden units and taking from 541 to 765 \nepochs to learn to correctly classify all of the input patterns. Generalization to test \npatterns not in the training set was excellent. PCA of unscaled, unstandardized \ncontributions yielded three components. A plot of rotated component scores for the 100 \ntraining patterns of net 1 is shown in Figure 1. The component scores are labeled \naccording to their respective quadrant in the input space. Three components are required to \naccount for 96.0% of the variance in the contributions. \nFigure 1 shows that component 1, with 44.3% of the variance in contributions, has the \nrole of distinguishing those quadrants with a positive output target (a and d) from those \nwith a negative output target (b and c). This is indicated by the fact that the black shapes \nare at the top of the component space cube in Figure 1 and the white shapes are at the \nbottom. Components 2 and 3 represent variation along the x and y input dimensions, \nrespectively. Component 2 accounted for 26.1 % of the variance in contributions, and \ncomponent 3 accounted for 25.6% of the variance in contributions. Input pairs from \nquadrants b and d (square shapes) are concentrated on the negative end of component 2, \nwhereas input pairs from quadrants a and c (circle shapes) are concentrated on the positive \nend of component 2. Similarly, input pairs from quadrants a and b cluster on the negative \nend of component 3, and input pairs from quadrants c and d cluster on the positive end of \ncomponent 3. Although the network was not explicitly trained to represent the x and y \ninput dimensions, it did so as an incidental feature of its learning the distinction between \nquadrants a and d vs. quadrants band c. Similar results were obtained from the other two \nnets learning the continuous XOR problem. \nIn contrast, PCA of the correlation matrix from these nets had yielded a somewhat less \nclear picture with the third component separating quadrants a and d from quadrants b and c, \nand the first two components representing variation along the x and y input dimensions \n(Shultz & Oshima-Takane, 1994). PCA of the correlation matrix of scaled contributions \nhad performed even worse, with plots of component scores indicating interactive \nseparation of the four quadrants, but with no clear roles for the individual components \n(Shultz & Elman, 1994). \nStandardized, rotated component loadings for net 1 are plotted in Figure 2. Such plots can \nbe examined to determine the role played by each contribution in the network. For \nexample, hidden units 2, 3, and 4 all playa major role in the job done by component 1, \ndistinguishing positive from negative outputs. \n\n4 APPLICATION TO COMPARATIVE ARITHMETIC \nArithmetic comparison requires a net to conclude whether a sum or a product of two \nintegers is greater than, less than, or equal to a comparison integer. Several psychological \nsimulations have used neural nets to make additive and multiplicative comparisons and \nthis has enhanced interest in this type of problem (McClelland, 1989; Shultz, Schmidt, \nBuckingham, & Mareschal, in press). \nThe first input unit coded the type of arithmetic operation to be performed: 0 for addition \nand 1 for multiplication. Three additional linear input units encoded the integers. Two of \nthese input units each coded a randomly selected integer in the range of 0 to 9, inclusive; \nanother input unit coded a randomly selected comparison integer. For addition problems, \ncomparison integers ranged from 0 to i9, inclusive; for multiplication, comparison \nintegers ranged from 0 to 82, inclusive. Two sigmoid output units coded the results of \nthe comparison operation. Target outputs of 0.5, -0.5 represented a greater than result, \ntargets of -0.5, 0.5 represented less than, and targets of 0.5,0.5 represented equal to. \n\n\f604 \n\nThomas R. Shultz, Yuriko Oshima-Takane, Yoshio Takane \n\n2 \n\nComponent 1 o \n\n-1 \n\n3 \n\nComponent 2 \n\n2 \n\n-2 \n\n-2 \n\nComponent 3 \n\nFigure 1. Rotated component scores for a continuous XOR net. Component scores for the \nx, y input pairs in quadrant a are labeled with black circles, those from quadrant b with \nwhite squares, those from quadrant c with white circles, and those from quadrant d with \nblack squares. The network's task is to distinguish pairs from quadrants a and d (the black \nshapes) from pairs from quadrants b and c (the white shapes). Some of the white shapes \nappear black because they are so densely packed, but all of the truly black shapes are \nrelatively high in the cube. \n\nHidden6 \n\nHidden5 \n\nHidden4 \n\nHidden3 \n\nc: \n0 \n'5 \n:g \nC Hidden2 \n0 \n() \n\nHidden1 \n\nInput2 \n\nInput1 \n\nComponent \n\nIII 3 \n\nII 2 \u2022 \n\n-1.0 \n\n-0.5 \n\n0.0 \n\nLoading \n\n0.5 \n\n1.0 \n\nFigure 2. Standardized, rotated component loadings for a continuous XOR net. Rotated \nloadings were standardized by dividing them by the standard deviation of the respective \ncontribution scores. \n\n\fAnalysis of Unstandardized Contributions in Cross Connected Networks \n\n605 \n\nThe training patterns had 100 addition and 100 multiplication problems, randomly \nselected, with the restriction that 45 of each had correct answers of greater than, 45 of each \nhad correct answers of less than, and 10 of each had correct answers of equal to. These \nconstraints were designed to reduce the natural skew of comparative values in the high \ndirection on multiplication problems. \nWe ran three nets for 1000 epochs each, at which point they were very close to mastering \nthe training patterns. Either seven or eight hidden units were recruited along the way. \nGeneralization to previously unseen test problems was very accurate. Four components \nwere sufficient to account for most the variance in un standardized contributions, 88.9% in \nthe case of net 1. \nFigure 3 displays the rotated component scores for the first two components of net 1. \nComponent I, accounting for 51.1 % of the variance, separated problems with greater than \nanswers from problems with less than answers, and located problems with equal to \nanswers in the middle, at least for addition problems. Component 2, with 20.2% of the \nvariance, clearly separated multiplication from addition. Contributions from the first input \nunit were strongly associated with component 2. Similar results obtained for the other \ntwo nets. \nComponents 3 and 4, with 10.6% and 7.0% of the variance, were sensitive to variation in \nthe second and third inputs, respectively. This is supported by an examination of the \nmean input values of the 20 most extreme component scores on these two components. \nRecall that the second and third inputs coded the two integers to be added or multiplied. \nThe negative end of component 3 had a mean second input value of 8.25; the positive end \nof this component had a mean second input value of 0.55. Component 4 had mean third \ninput value of 2.00 on the negative end and 7.55 on the positive end. \nIn contrast, PCA of the correlation matrix for these nets had yielded a far more clouded \npicture, with the largest components focusing on input variation and lesser components \ndoing bits and pieces of the separation of answer types and operations in an interactive \nmanner (ShUltz & Oshima-Takane, 1994). Problems with equal to answers were not \nisolated by any of the components. PCA of scaled contributions had produced three \ncomponents that interactively separated the three answer types and operations, but failed \nto represent variation in input integers (ShUltz & Elman, 1994). Essentially similar \nadvantages for using the variance-covariance matrix were found for nets learning either \naddition alone or multiplication alone. \n\n5 APPLICATION TO THE TWO-SPIRALS PROBLEM \nThe two-spirals problem requires a particularly difficult discrimination and a large number \nof hidden units. The input space is defined by two interlocking spirals that wrap around \ntheir origin three times. There are two sets of 97 real-valued x, y pairs, with each set \nrepresenting one of the spirals, and a single sigmoid output unit coded for the identity of \nthe spiral. Our three nets took between 1313 and 1723 epochs to master the distinction, \nand recruited from 12 to 16 hidden units. All three nets generalized well to previously \nunseen input pairs on the paths of the two spirals. \nPCA of the variance-covariance matrix for net 1 revealed that six components accounted \nfor a total of 97.9% of the variance in contributions. The second and fourth of these \ncomponents together distinguished one spiral from the other, with 20.7% and 9.8% of the \nvariance respectively. Rotated component scores for these two components are plotted in \nFigure 4. A diagonal line drawn on Figure 4 from coordinates -2,2 to 2, -2 indicates that \n11 points from each spiral were misclassified by components 2 and 4. This is only 11.3% \nof the data points in the training patterns. The fact that the net learned all of the training \npatterns implies that these exceptions were picked up by other components. \nComponents 1 and 6, with 40.7% and 6.4% of the variance, were sensitive to variation in \nthe x and y inputs, respectively. Again, this was confirmed by the mean input values of \nthe 20 most extreme component scores on these two components. On component I, the \nnegative end had a mean x value of 3.55 and the positive end had a mean y value of -3.55. \n\n\f606 \n\nThomas R. Shultz. Yuriko Oshima-Takane. Yoshio Takane \n\n2 \n\nx> It\" \n#~ \u2022 . \u00b7x \n\nComponent 1 \n\no \n\n-1 \n\n+< \n\n-2 L..-__ ......L ___ ....L... __ --iL..-__ -J \n2 \n\n-2 \n\n-1 \n\n0 \n\nComponent 2 \n\nFigure 3. Rotated component scores for an arithmetic comparison net. Greater than \nproblems are symbolized by circles, less than problems by squares, addition by white \nshapes, and multiplication by black shapes. For equal to problems only, addition is \nrepresented by + and multiplication by X. Although some densely packed white shapes \nmay appear black, they have no overlap with truly black shapes. All of the black squares \nare concentrated around coordinates -1, -1. \n\n2 \n\nComponent 2 \n\no \n\n-1 \n\no \n\no \n\no \n\no \n\nSpiral 1 \n\n-2 1....-__ --' _ __ ....... _ __ ........ __ _ \n\n-2 \n\n-1 \n\no \n\n2 \n\nFigure 4. Rotated component scores for a two-spirals net. Squares represent data points \nfrom spiral 1, and circles represent data points from spiral 2. \n\nComponent 4 \n\n\fAnalysis of Unstandardized Contributions in Cross Connected Networks \n\n607 \n\nOn component 6, the negative end had a mean x value of 2.75 and the positive end had a \nmean y value of -2.75. The skew-symmetry of these means is indicative of the perfectly \nsymmetrical representations that cascade-correlation nets achieve on this highly \nsymmetrical problem. Every data point on every component has a mirror image negative \nwith the opposite signed component score on that same component. This -x, -y mirror \nimage point is always on the other spiral. Other components concentrated on particular \nregions of the spirals. The other two nets yielded essentially similar results. \nThese results can be contrasted with our previous analyses of the two-spirals problem, \nnone of which succeeded in showing a clear separation of the two spirals. PCAs based on \nscaled (Shultz & Elman, 1994) or unscaled (Shultz & Oshima-Takane, 1994) correlation \nmatrices showed extensive symmetries but never a distinction between one spiral and \nanother.1 Thus, although it was clear that the nets had encoded the problem's inherent \nsymmetries, it was still unclear from previous work how the nets used this or other \ninformation to distinguish points on one spiral from points on the other spiral. \n\n6 DISCUSSION \nOn each of these problems, there was considerable variation among network solutions, as \nrevealed, for example, by variation in numbers of hidden units recruited and signs and \nsizes of connection weights. In spite of such variation, the present technique of applying \npeA to the variance-covariance matrix of contributions yielded results that are sufficiently \nabstract to characterize different nets learning the same problem. The knowledge \nrepresentations produced by this analysis clearly identify the essential information that the \nnet is being trained to utilize as well as more incidental features of the training patterns \nsuch as the nature of the input space. \nThis research strengthens earlier conclusions that PCA of network contributions is a \nuseful technique for understanding network performance (Sanger, 1989), including \nrelatively intractable multi-level cross connected nets (Shultz & Elman, 1994; Shultz & \nOshima-Takane, 1994). However, the current study underscores the point that there are \nseveral ways to prepare a contribution matrix for PCA, not all of which yield equally \nvalid or useful results. Rather than starting with a three dimensional matrix of output unit \nx hidden unit x input pattern and focusing on either one output unit at a time or one \nhidden unit at a time (Sanger, 1989), it is preferable to collapse contributions into a two \ndimensional matrix of output weight x input pattern. The latter is not only more \nefficient, but yields more valid results that characterize the network as a whole, rather than \nsmall parts of the network. \nAlso, rather than scaling contributions by the sign of the output target (Sanger, 1989), it \nis better to use unsealed contributions. Unsealed contributions are not only more realistic, \nsince the network has no knowledge of output targets during its feed-forward phase, but \nalso produce clearer interpretations of the nefs knowledge representations (Shultz & \nOshima-Takane, 1994). The latter claim is particularly true in terms of sensitivity to \ninput dimensions and to operational distinctions between adding and multiplying. Plots of \ncomponent scores based on unscaled contributions are typically not as dense as those \nbased on sealed contributions but are more revealing of the network's knowledge. \nFinally, rather than applying peA to the correlation matrix of contributions, it makes \nmore sense to apply it to the variance-covariance matrix. As noted in the introduction, \nusing the correlation matrix effectively standardizes the contributions to have identical \nmeans and variances, thus obseuring the role of network connection weights. The present \nresults indicate much clearer knowledge representations when the variance-covariance \nmatrix is used since connection weight information is explicitly retained. Matrix \ndifferences were especially marked on the more difficult problems, such as two-spirals, \nwhere the only peAs to reveal how nets distinguished the spirals were those based on \n\n1 Results from un scaled contributions on the two-spirals problem were not actually \npresented in Shultz & Oshima-Takane (1994) since they were not very clear. \n\n\f608 \n\nThomas R. Shultz, Yuriko Oshima-Takane, Yoshio Takane \n\nvariance-covariance matrices. But the relative advantages of using the variance-covariance \nmatrix were evident on the easier problems too. \nThere has been recent rapid progress in the study of the knowledge representations leamed \nby neural nets. Feed-forward nets can be viewed as function approximators for relating \ninputs to outputs. Analysis of their knowledge representations should reveal how inputs \nare encoded and transformed to produce the correct outputs. PCA of network contributions \nsheds light on how these function approximations are done. Components emerging from \nPCA are orthonormalized ingredients of the transformations of inputs that produce the \ncorrect outputs. Thus, PCA helps to identify the nature of the required transformations. \nFurther progress might be expected from combining PCA with other matrix \ndecomposition techniques. Constrained PCA uses external information to decompose \nmultivariate data matrices before applying PCA (Takane & Shibayama, 1991). \nAnalysis techniques emerging from this research will be useful in understanding and \napplying neural net research. Component loadings, for example, could be used to predict \nthe results of lesioning experiments with neural nets. Once the role of a hidden unit has \nbeen identified by virtue of its association with a particular component, then one could \npredict that lesioning this unit would impair the function served by the component. \n\nAcknowledgments \nThis research was supported by the Natural Sciences and Engineering Research Council of \nCanada. \n\nReferences \nCattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral \n\nResearch, 1,245-276. \n\nFahlman, S. E., & Lebiere, C. (1990.) The Cascade-Correlation learning architecture. In \nD. Touretzky (Ed.), Advances in neural information processing systems 2, (pp. 524-\n532). Mountain View, CA: Morgan Kaufmann. \n\nJolliffe, I. T. (1986). Principal component analysis. Berlin: Springer Verlag. \nLang, K. J., & Wi tbrock , M. J. (1988). Learning to tell two spirals apart. In D. \nTouretzky, G. Hinton, & T. Sejnowski (Eds)., Proceedings of the Connectionist \nModels Summer School, (pp. 52-59). Mountain View, CA: Morgan Kaufmann. \n\nMcClelland, J. L. (1989). Parallel distributed processing: Implications for cognition and \ndevelopment. In Morris, R. G. M. (Ed.), Parallel distributed processing: Implications \nfor psychology and neurobiology, pp. 8-45. Oxford University Press. \n\nMcCloskey, M. (1991). Networks and theories: The place of connectionism in cognitive \n\nscience. Psychological Science, 2, 387-395. \n\nSanger, D. (1989). Contribution analysis: A technique for assigning responsibilities to \n\nhidden units in connectionist networks. Connection Science, 1, 115-138. \n\nShultz, T. R., & Elman, J. L. (1994). Analyzing cross connected networks. In J. D. \nCowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information \nProcessing Systems 6. San Francisco, CA: Morgan Kaufmann. \n\nShUltz, T. R., & Oshima-Takane, Y. (1994). Analysis of un scaled contributions in cross \nconnected networks. In Proceedings of the World Congress on Neural Networks (Vol. \n3, pp. 690-695). Hillsdale, NJ: Lawrence Erlbaum. \n\nShultz, T. R., Schmidt, W. C., Buckingham, D., & Mareschal, D. (In press). Modeling \ncognitive development with a generative connectionist algorithm. In G. Halford & T. \nSimon (Eds.), Developing cognitive competence: New approaches to process \nmodeling. Hillsdale, NJ: Erlbaum. \n\nTakane, Y., & Shibayama, T. (1991). Principal component analysis with external \n\ninformation on both subjects and variables. Psychometrika, 56, 97-120. \n\n\f", "award": [], "sourceid": 881, "authors": [{"given_name": "Thomas", "family_name": "Shultz", "institution": null}, {"given_name": "Yuriko", "family_name": "Oshima-Takane", "institution": null}, {"given_name": "Yoshio", "family_name": "Takane", "institution": null}]}