{"title": "Higher Order Statistical Decorrelation without Information Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 247, "page_last": 254, "abstract": null, "full_text": "Higher Order Statistical Decorrelation without \n\nInformation Loss \n\nGustavo Deco \nSiemensAG \n\nCentral Research \nOtto-Hahn-Ring 6 \n81739 Munich \n\nGeIIDany \n\nWilfried Brauer \n\nTechnische UniversiUit MUnchen \n\nInstitut fur InfoIIDatik \n\nArcisstr. 21 \n80290 Munich \n\nGeIIDany \n\nAbstract \n\nA neural network learning paradigm based on information theory is pro(cid:173)\nposed as a way to perform in an unsupervised fashion, redundancy \nreduction among the elements of the output layer without loss of infor(cid:173)\nmation from the sensory input. The model developed performs nonlin(cid:173)\near decorrelation up to higher orders of the cumulant tensors and results \nin probabilistic ally independent components of the output layer. This \nmeans that we don't need to assume Gaussian distribution neither at the \ninput nor at the output. The theory presented is related to the unsuper(cid:173)\nvised-learning theory of Barlow, which proposes redundancy reduction \nas the goal of cognition. When nonlinear units are used nonlinear princi(cid:173)\npal component analysis is obtained. In this case nonlinear manifolds can \nbe reduced to minimum dimension manifolds. If such units are used the \nnetwork performs a generalized principal component analysis in the \nsense that non-Gaussian distributions can be linearly decorrelated and \nhigher orders of the correlation tensors are also taken into account. The \nbasic structure of the architecture involves a general transfOlmation that \nis volume conserving and therefore the entropy, yielding a map without \nloss of infoIIDation. Minimization of the mutual infoIIDation among the \noutput neurons eliminates the redundancy between the outputs and \nresults in statistical decorrelation of the extracted features. This is \nknown as factorialleaming. \n\n\f248 \n\nGustavo Deco, Wilfried Brauer \n\n1 INTRODUCTION \nOne of the most important theories of feature extraction is the one proposed by Barlow \n(1989). Barlow describes the process of cognition as a preprocessing of the sensorial \ninformation performed by the nervous system in order to extract the statistically relevant \nand independent features of the inputs without loosing information. This means that the \nbrain should statistically decorrelate the extracted information. As a learning strategy Bar(cid:173)\nlow (1989) formulated the principle of redundancy reduction. This kind of learning is \ncalled factorial learning. Recently Atick and Redlich (1992) and Redlich (1993) concen(cid:173)\ntrate on the original idea of Barlow yielding a very interesting formulation of early visual \nprocessing and factorial learning. Redlich (1993) reduces redundancy at the input by using \na network structure which is a reversible cellular automaton and therefore guarantees the \nconservation of information in the transformation between input and output. Some nonlin(cid:173)\near extensions of PCA for decorrelation of sensorial input signals were recently intro(cid:173)\nduced. These follow very closely Barlow's original ideas of unsupervised learning. \nRedlich (1993) use similar information theoretic concepts and reversible cellular automata \narchitectures in order to define how nonlinear decorrelation can be performed. The aim of \nour work is to formulate a neural network architecture and a novel learning paradigm that \nperforms Barlow's unsupervised learning in the most general fashion. The basic idea is to \ndefine an architecture that assures perfect transmission without loss of information. Con(cid:173)\nsequently the nonlinear transformation defined by the neural architecture is always bijec(cid:173)\ntive. The architecture performs a volume-conserving transformation (determinant of the \nJacobian matrix is equal one). As a particular case we can derive the reversible cellular \nautomata architecture proposed by Redlich (1993). The learning paradigm is defined so \nthat the components of the output signal are statistically decorrelated. Due to the fact that \nthe output distribution is not necessarily Gaussian, even if the input is Gaussian, we per(cid:173)\nform a cumulant expansion of the output distribution and find the rules that should be sat(cid:173)\nisfied by the higher order correlation tensors in order to be decorrelated. \n\n2 THEORETICAL FORMALISM \nLet us consider an input vector x of dimensionality d with components distributed \naccording to the probability distribution P (X) , which is not factorial, i.e. the components \nof x are correlated. The goal of Barlow's unsupervised learning rule is to find a transfor(cid:173)\nmation \n\n,. \n->,. \ny = F (x) \n\n(2.1) \n\nsuch that the components of the output vector d -dimensional yare statistically decorre(cid:173)\nlated. \nThis means that the probability distributions of the components Y j are independent and \ntherefore, \n\nd \n\npO) = IT P (y) . \n\nj \n\n(2.2) \n\nThe objl?,.ctive of factorial learning is to find a neural network, which performs the transfor(cid:173)\nmation F ( ) such that the joint probability distribution P 0) of the output signals is fac(cid:173)\ntorized as in eq. (2.2). In order to implement factorial learning, the information contained \nin the input should be transferred to the output neurons without loss but, the probability \ndistribution of the output neurons should be statistically decorrelated. Let us now define \n\n\fHigher Order Statistical Decorrelation without /n/onl1atioll Loss \n\n249 \n\nthese facts from the information theory perspective. The first aspect is to assure the \nentropy is conselVed, i.e. \n\nH (x) = H (y) \n\n(2.3) \nwhere the symbol !! (a) denotes the entropy of a and H (a/b) \nthe conditional \nentropy of a given b. One way to achieve this goal is to construct an architecture that \nindependently of its synaptic parameters satisfies always eq. (2.3). Thus the architecture \nwill conselVe information or entropy. The transmitted entropy satisfies \n\nH (J) ~ H (x) + J P (x) In (det (~r\u00bb dx \n\n(2.4) \n\n-\" \n\nwhere equality holds only if F is bijective, i.e. reversible. ConselVation of information \nand bijectivity is assured if the neural transformation conselVes the volume, which mathe(cid:173)\nmatically can be expressed by the fact that the Jacobian of the transformation should have \ndeterminant unity. In section 3 we fonnulate an architecture that always conselVes the \nentropy. Let us now concentrate on the main aspect of factorial learning, namely the \ndecorrelation of the output components. Here the problem is to find a volume-conselVing \ntransformation that satisfies eq. (2.2). The major problem is that the distribution of the out(cid:173)\nput signal will not necessarily be Gaussian. Therefore it is impossible to use the technique \nof minimizing the mutual information between the components of the output as done by \nRedlich (1993). The only way to decorrelate non-Gaussian distributions is to expand the \ndistribution in higher orders of the correlation matrix and impose the independence condi(cid:173)\ntion of eq. (2.2). In order to achieve this we propose to use a cumulant expansion of the \noutput distribution. Let us define the Fourier transform of the output distribution, \n\n~(K) = JdY ei(K\u00b7h P(J;~(Ki) = fdYi ei(Ki\u00b7Yi) P(y) \n\n(2.5) \n\nThe cumulant expansion of a distribution is (Papoulis, 1991) \n\nIn the Fourier space the independence condition is given by (papoulis, 1991) \n\nwhich is equivalent to \n\n...>. \n\nIn(~(K\u00bb = Incrl~(Ki\u00bb = Eln(~(Ki\u00bb \n\nI \n\nI \n\n(2.6) \n\n(2.7) \n\n(2.8) \n\nPutting eq. (2.8) and the cumulant expansions of eq. (2.6) together, we obtain that in the \ncase of independence the following equality is satisfied \n\n(2.9) \n\nIn both expansions we will only consider the first four cumulants. Mter an extra transfor(cid:173)\nmation \n\n.. \ny' .. y -\n\n~ \n\n-,.-\n(y) \n\n(2.10) \n\n\f250 \n\nGustavo Deco, Wilfried Brauer \n\nto remove the bias , \n\nwith i . ae; \n\n(3.3) \n\n.>. \n\n.>. \n\nI \n\nI \n\nIn order to calculate the derivative of the cost functions we need \n\na \ni..J 00 \nP \n\nI \n\n1 \n\nC i ... j \n== N-~ {\",,\\\"::\"(y.-y .) ... (y .-y.) + (y.-y.) ... \"\"\\\"::\"(y . -y, -a = -~ {,\\::\"(y .)}(3.4) \n-\no \nwhere e represents the parameters e j and Wi. The sums in both equations extend over the \n\np \n\n] \n\nI \n\n] \n\nI \n\n] \n\nI \n\nN training patterns. The gradients of the different outputs are \n\n_. ay i \n0 \n\nJ \n\na \n1 \nN i..J 00 \n\n_ \n\n_ \n\n_ a \n00 \n\nI \n\na _ a \n- .. -g; ,~Yt -\n---;:-Yi -\nae; \nae; \n\n.j \n\n[OJ \n\n_ a \n\nI \n\na \n\nW; \n\na \n\nWi \n\n(ahgt) (a ..... !;)&i>k+ (a ..... !;) \n\n(3.5) \n\nwhere &i> k is equal to I if i > k and 0 othelWise. In this paper we choose a polynomial \nfonn for the functions I and g. This model involves higher order neurons. In this case \neach function Ii or g; is a product of polynomial functions of the inputs. The update equa(cid:173)\ntions are given by \n\n(3.6) \n\nwhere R is the order of the polynomial used. In this case the two-layer architecture is a \nhigher order network with a general volume-conserving structure. The derivatives \ninvolved in the learning rule are given by \n\na _ \n-a-Yk -\n\n0) \u2022 . \n\nljr \n\n(3.7) \n\n\fHigher Order Statistical Decorrelation without Information Loss \n\n253 \n\n4 RESULTS AND SIMULATIONS \nWe will present herein two different experiments using the architecture defined in this \npaper. The input space in all experiments is two-dimensional in order to show graphically \nthe results and effects of the presented model. The experiments aim at learning noisy non(cid:173)\nlinear polynomial and rational curves. Figure 2.a and 2.b plot the input and output space of \nthe second experiment after training is finished, respectively. In this case the noisy logistic \nmap was used to generate the input: \n\n(4.1) \n\nwhere'\\) introduces I % Gaussian noise. In this case a one-layer polynomial network with \nR - 2 was used. The learning constant was 11 - 0.01 and 20000 iterations of training \nwere performed. The result of Fig. 2.b is remarkable. The volume-conserving network \ndecorrelated the output space extracting the strong nonlinear correlation that generated the \ncurve in the input space. This means that after training only one coordinate is important to \ndescribe the curve. \n\n(a) \n\n(b) \n\n\u2022\u2022 \n\u2022\u2022 \n\u2022\u2022 \n\no \u2022 \n\nos \n\n04 \n\n02 \n\n' \u00b71'=\".2 ---.,....---:.:'-:-.---:.:':-.---,.,':-.-----::':: .. -----7----,,' .. \n\n.g7.-~-~.:'-:-2---,O~.--0~.-~ \u2022\u2022 ~~-~ \n\nFigure 2: Input and Output space distribution after training with a one-layer polynomial \nvolume-conseIVing network of order for the logistic map. (a) input space; (b) output space. \nThe whole information was compressed into the first coordinate of the output. This is the \ngeneralization of data compression normally performed by using linear peA (also called \nKarhunen-Loewe transformation). The next experiment is similar, but in this case a two(cid:173)\nlayer network of order R .. 4 was used. The input space is given by the rational function \n\nx 2 = O.2x I + \n\n3 \nXl \n\n2 + '\\l \n\n(1 +xI ) \n\n(4.2) \n\nwhere Xl and '\\) are as in the last case. The results are shown in Fig. 4.a (input space) and \nFig. 4.b (output space). Fig. 4.c shows the evolution of the four summands of eq (2.18) \nduring learning. It is important to remark that at the beginning the tensors of second and \nthird order are equally important. During learning all summands are simultaneously mini(cid:173)\nmized, resulting in a statistically decorrelated output. The training was performed during \n20000 iterations and the learning constant was \n\n11 = 0.005 . \n\n\f254 \n\n\" \n\n\" \n\n.. \n\n'I. \n\nI \n\n.. \n\nGustavo Deco, Wilfried Brauer \n\n(b) \n\n(c) \n\netrIlIt' \n\n~st2 \n\nII. \n\n\" .. \" .. \n\nII \u2022 \n\n\u2022\u2022 \n\nI. \nI . \nt. \n\nI ~. _,,, ...... H\"'t . \n\n, ....... 1\u00b7 .. -\"' - ' ... . . \n\n\" \n\n... \n\n(a) \n\n.-\n.l \n\n/ ' \n\n,'\" \n\n.. \n\n.. \n\n'1. \n\n\" \n\n\" \n\n\" \n\n\" \n\nI .. ... \n\nt.. \n\ncost4b \n\n.. II. M .. \n\n.~ \n\n,-\n\nFigure 4: Input and Output space distribution after training with a two-layer polynomial volume-conselVing \nnetwork of order for the noisy CUIVe of~. (4.2). (a) input space; (b) output space (c) Development of \nthe four summands of the cost function (~ 2.18) during learning: (cost2) fiist summand (second order COIre(cid:173)\nlation tensor); (cost 3) second summand (tliird correlation order tensor); (cost 4a) third summand (fourth order \ncorrelation tensor); (cost4b) fourth summand (fourth order correlation tensor). \n\n5 CONCLUSIONS \nWe proposed a unsupervised neural paradigm, which is based on Infonnation Theory. The \nalgorithm perfonns redundancy reduction among the elements of the output layer without \nloosing infonnation, as the data is sent through the network. The model developed per(cid:173)\nfonns a generalization of Barlow's unsupervised learning, which consists in nonlinear \ndecorrelation up to higher orders of the cumulant tensors. After training the components of \nthe output layer are statistically independent. Due to the use of higher order cumulant \nexpansion arbitrary non-Gaussian distributions can be rigorously handled. When nonlin(cid:173)\near units are used nonlinear principal component analysis is obtained. In this case nonlin(cid:173)\near manifolds can be reduced to a minimum dimension manifolds. When linear units are \nused, the network performs a generalized principal component analysis in the sense that \nnon-Gaussian distribution can be linearly decorrelated.This paper generalizes previous \nworks on factorial learning in two ways: the architecture performs a general nonlinear \ntransformation without loss of information and the decorrelation is perfonned without \nassuming Gaussian distributions. \n\nReferences: \n\nH. Barlow. (1989) Unsupervised Learning. Neural Computation, 1,295-311. \nA. Papoulis. (1991) Probability, Random Variables, and Stochastic Processes. 3. Edition, \nMcGraw-Hill, New York. \nA. N. Redlich. (1993) Supervised Factorial Learning. Neural Computation, 5, 750-766. \n\n\f", "award": [], "sourceid": 901, "authors": [{"given_name": "Gustavo", "family_name": "Deco", "institution": null}, {"given_name": "Wilfried", "family_name": "Brauer", "institution": null}]}