{"title": "One-unit Learning Rules for Independent Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 480, "page_last": 486, "abstract": null, "full_text": "One-unit Learning Rules for \n\nIndependent Component Analysis \n\nAapo Hyvarinen and Erkki Oja \nHelsinki University of Technology \n\nLaboratory of Computer and Information Science \nRakentajanaukio 2 C, FIN-02150 Espoo, Finland \nemail: {Aapo.Hyvarinen.Erkki.Oja}(Qhut.fi \n\nAbstract \n\nNeural one-unit learning rules for the problem of Independent Com(cid:173)\nponent Analysis (ICA) and blind source separation are introduced. \nIn these new algorithms, every ICA neuron develops into a sepa(cid:173)\nrator that finds one of the independent components. The learning \nrules use very simple constrained Hebbianjanti-Hebbian learning \nin which decorrelating feedback may be added. To speed up the \nconvergence of these stochastic gradient descent rules, a novel com(cid:173)\nputationally efficient fixed-point algorithm is introduced. \n\n1 \n\nIntroduction \n\nIndependent Component Analysis (ICA) (Comon, 1994; Jutten and Herault, 1991) \nis a signal processing technique whose goal is to express a set of random vari(cid:173)\nables as linear combinations of statistically independent component variables. The \nmain applications of ICA are in blind source separation, feature extraction, and \nblind deconvolution. In the simplest form of ICA (Comon, 1994), we observe m \nscalar random variables Xl, ... , Xm which are assumed to be linear combinations of \nn unknown components 81, ... 8 n that are zero-mean and mutually statistically inde-\npendent. In addition, we must assume n ~ m. If we arrange the observed variables \nXi into a vector x = (Xl,X2, ... ,xm)T and the component variables 8j into a vector \ns, the linear relationship can be expressed as \n\nHere, A is an unknown m x n matrix of full rank, called the mixing matrix. Noise \nmay also be added to the model, but it is omitted here for simplicity. The basic \n\nx=As \n\n(1) \n\n\fOne-unit Learning Rules for Independent Component Analysis \n\n481 \n\nproblem of ICA is then to estimate (separate) the realizations of the original inde(cid:173)\npendent components Sj, or a subset of them, using only the mixtures Xi. This is \nroughly equivalent to estimating the rows, or a subset of the rows, of the pseudoin(cid:173)\nverse of the mixing matrix A . The fundamental restriction of the model is that \nwe can only estimate non-Gaussian independent components, or ICs (except if just \none of the ICs is Gaussian). Moreover, the ICs and the columns of A can only be \nestimated up to a multiplicative constant, because any constant multiplying an IC \nin eq. (1) could be cancelled by dividing the corresponding column of the mixing \nmatrix A by the same constant. For mathematical convenience, we define here that \nthe ICs Sj have unit variance. This makes the (non-Gaussian) ICs unique, up to \ntheir signs. Note the assumption of zero mean of the ICs is in fact no restriction, as \nthis can always be accomplished by subtracting the mean from the random vector \nx. Note also that no order is defined between the lCs. \n\nIn blind source separation (Jutten and Herault, 1991), the observed values of x \ncorrespond to a realization of an m-dimensional discrete-time signal x(t), t = 1,2, .... \nThen the components Sj(t) are called source signals. The source signals are usually \noriginal, uncorrupted signals or noise sources. Another application of ICA is feature \nextraction (Bell and Sejnowski, 1996; Hurri et al., 1996), where the columns of the \nmixing matrix A define features, and the Sj signal the presence and the amplitude \nof a feature. A closely related problem is blind deconvolution, in which a convolved \nversion x(t) of a scalar LLd. signal s(t) is observed. The goal is then to recover the \noriginal signal s(t) without knowing the convolution kernel (Donoho, 1981). This \nproblem can be represented in a way similar to eq. (1), replacing the matrix A by \na filter. \n\nThe current neural algorithms for Independent Component Analysis, e.g. (Bell and \nSejnowski, 1995; Cardoso and Laheld, 1996; Jutten and Herault, 1991; Karhunen \net al., 1997; Oja, 1995) try to estimate simultaneously all the components. This is \noften not necessary, nor feasible, and it is often desired to estimate only a subset of \nthe ICs. This is the starting point of our paper. We introduce learning rules for a \nsingle neuron, by which the neuron learns to estimate one of the ICs. A network of \nseveral such neurons can then estimate several (1 to n) ICs. Both learning rules for \nthe 'raw' data (Section 3) and for whitened data (Section 4) are introduced. If the \ndata is whitened, the convergence is speeded up, and some interesting simplifications \nand approximations are made possible. Feedback mechanisms (Section 5) are also \nmentioned. Finally, we introduce a novel approach for performing the computations \nneeded in the ICA learning rules, which uses a very simple, yet highly efficient, fixed(cid:173)\npoint iteration scheme (Section 6). An important generalization of our learning rules \nis discussed in Section 7, and an illustrative experiment is shown in Section 8. \n\n2 Using Kurtosis for leA Estimation \n\nWe begin by introducing the basic mathematical framework of ICA. Most sug(cid:173)\ngested solutions for ICA use the fourth-order cumulant or kurtosis of the signals, \ndefined for a zero-mean random variable vas kurt(v) = E{v4 } - 3(E{V2})2. For a \nGaussian random variable, kurtosis is zero. Therefore, random variables of positive \nkurtosis are sometimes called super-Gaussian, and variables of negative kurtosis \nsub-Gaussian. Note that for two independent random variables VI and V2 and for a \nscalar 0:, it holds kurt(vi + V2) = kurt(vJ) + kurt(v2) and kurt(o:vd = 0:4 kurt(vd\u00b7 \n\n\f482 \n\nA. Hyviirinen and E. Oja \n\nLet us search for a linear combination of the observations Xi, say, w T x, such that \nit has maximal or minimal kurtosis. Obviously, this is meaningful only if w is \nsomehow bounded; let us assume that the variance of the linear combination is \nconstant: E{(wTx)2} = 1. Using the mixing matrix A in eq. (1), let us define \nz = ATw. Then also IIzl12 = w T A ATw = w T E{xxT}w = E{(WTx)2} = 1. \nUsing eq. (1) and the properties of the kurtosis, we have \n\nkurt(wT x) = kurt(wT As) = kurt(zT s) = L zJ kurt(sj) \n\nn \n\n(2) \n\nj=1 \n\nUnder the constraint E{(wT x)2} = IIzll2 = 1, the function in (2) has a number of \nlocal minima and maxima. To make the argument clearer, let us assume for the \nmoment that the mixture contains at least one Ie whose kurtosis is negative, and \nat least one whose kurtosis is positive. Then, as may be obvious, and was rigorously \nproven by Delfosse and Loubaton (1995), the extremal points of (2) are obtained \nwhen all the components Zj of z are zero except one component which equals \u00b11. \nIn particular, the function in (2) is maximized (resp. minimized) exactly when the \nlinear combination w T x = zT S equals, up to the sign, one of the les Sj of positive \n(resp. negative) kurtosis. Thus, finding the extrema of kurtosis of w T x enables \nestimation of the independent components. Equation (2) also shows that Gaussian \ncomponents, or other components whose kurtosis is zero, cannot be estimated by \nthis method. \nTo actually minimize or maximize kurt(wT x), a neural algorithm based on gradient \ndescent or ascent can be used. Then w is interpreted as the weight vector of a neuron \nwith input vector x and linear output w T x. The objective function can be simplified \nbecause of the constraint E{ (wT X)2} = 1: it holds kurt(wT x) = E{ (wT x)4} - 3. \nThe constraint E{(wT x)2} = 1 itself can be taken into account by a penalty term. \nThe final objective function is then of the form \n\n(3) \n\nwhere a, (3 > 0 are arbitrary scaling constants, and F is a suitable penalty function. \nOur basic leA learning rules are stochastic gradient descents or ascents for an \nobjective function of this form. In the next two sections, we present learning rules \nresulting from adequate choices of the penalty function F . Preprocessing of the \ndata (whitening) is also used to simplify J in Section 4. An alternative method for \nfinding the extrema of kurtosis is the fixed-point algorithm; see Section 6. \n\n3 Basic One-Unit leA Learning Rules \n\nIn this section, we introduce learning rules for a single neural unit. These basic \nlearning rules require no preprocessing of the data, except that the data must be \nmade zero-mean. Our learning rules are divided into two categories. As explained in \nSection 2, the learning rules either minimize the kurtosis of the output to separate \nles of negative kurtosis, or maximize it for components of positive kurtosis. \n\nLet us assume that we observe a sample sequence x(t) of a vector x that is a \nlinear combination of independent components 81, ... , 8 n according to eq. (1). For \nseparating one of the les of negative kurtosis, we use the following learning rule for \n\n\fOne-unit Learning Rules for Independent Component Analysis \n\nthe weight vector w of a neuron: \n\nAw(t) O. This learning rule is clearly a stochastic \ngradient descent for a function of the form (3), with F(u) = -u. To separate an IC \nof positive kurtosis, we use the following learning rule: \n\nAw(t) a are arbitrary constants. This learning rule is a stochas(cid:173)\ntic gradient ascent for a function of the form (3), with F(u) = -u2. Note that \nw(t)TCw(t) in g+ might also be replaced by (E{(w(t)Tx(t))2})2 or by IIw(t)114 to \nenable a simpler implementation. \n\nfollows: \n\nIt can be proven (Hyvarinen and Oja, 1996b) that using the learning rules (4) and \n(5), the linear output converges to CSj(t) where Sj(t) is one of the ICs, and C is a \nscalar constant. This multiplication of the source signal by the constant c is in fact \nnot a restriction, as the variance and the sign of the sources cannot be estimated. \nThe only condition for convergence is that one of the ICs must be of negative (resp. \npositive) kurtosis, when learning rule (4) (resp. learning rule (5)) is used. Thus \nwe can say that the neuron learns to separate (estimate) one of the independent \ncomponents. It is also possible to combine these two learning rules into a single rule \nthat separates an IC of any kurtosis; see (Hyvarinen and Oja, 1996b). \n\n4 One-Unit ICA Learning Rules for Whitened Data \n\nWhitening, also called sphering, is a very useful preprocessing technique. It speeds \nup the convergence considerably, makes the learning more stable numerically, and \nallows some interesting modifications of the learning rules. Whitening means that \nthe observed vector x is linearly transformed to a vector v = Ux such that its \nelements Vi are mutually uncorrelated and all have unit variance (Comon, 1994). \nThus the correlation matrix of v equals unity: E{ vvT} = I. This transformation is \nalways possible and can be accomplished by classical Principal Component Analysis. \nAt the same time, the dimensionality of the data should be reduced so that the \ndimension of the transformed data vector v equals n, the number of independent \ncomponents. This also has the effect of reducing noise. \n\nLet us thus suppose that the observed signal vet) is whitened (sphered). Then, in \norder to separate one of the components of negative kurtosis, we can modify the \nlearning rule (4) so as to get the following learning rule for the weight vector w: \n\nAw(t) 1 and b > O. This modification is valid because we now have Ev(wT v) = w \nand thus we can add +w(t) in the linear part of g- and subtract wet) explicitly \nafterwards. The modification is useful because it allows us to approximate g- with \n\n\f484 \n\nA. Hyviirinen and E. Oja \n\nthe 'tanh' function, as w(t)T vet) then stays in the range where this approximation \nis valid. Thus we get what is perhaps the simplest possible stable Hebbian learning \nrule for a nonlinear Perceptron. \n\nTo separate one of the components of positive kurtosis, rule (5) simplifies to: \n\ndw(t)