{"title": "A New Learning Algorithm for Blind Signal Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 763, "abstract": null, "full_text": "A New Learning Algorithm for Blind \n\nSignal Separation \n\ns. Amari* \n\nUniversity of Tokyo \n\nBunkyo-ku, Tokyo 113, JAPAN \n\namari@sat.t. u-tokyo.ac.jp \n\nLab. for Artificial Brain Systems \n\nA. Cichocki \n\nFRP, RIKEN \n\nWako-Shi, Saitama, 351-01, JAPAN \n\ncia@kamo.riken.go.jp \n\nLab. for Information Representation \n\nH. H. Yang \n\nFRP, RIKEN \n\nWako-Shi, Saitama, 351-01, JAPAN \n\nhhy@koala.riken.go.jp \n\nAbstract \n\nA new on-line learning algorithm which minimizes a statistical de(cid:173)\npendency among outputs is derived for blind separation of mixed \nsignals. The dependency is measured by the average mutual in(cid:173)\nformation (MI) of the outputs. The source signals and the mixing \nmatrix are unknown except for the number of the sources. The \nGram-Charlier expansion instead of the Edgeworth expansion is \nused in evaluating the MI. The natural gradient approach is used \nto minimize the MI. A novel activation function is proposed for the \non-line learning algorithm which has an equivariant property and \nis easily implemented on a neural network like model. The validity \nof the new learning algorithm are verified by computer simulations. \n\n1 \n\nINTRODUCTION \n\nThe problem of blind signal separation arises in many areas such as speech recog(cid:173)\nnition, data communication, sensor signal processing, and medical science. Several \nneural network algorithms [3, 5, 7] have been proposed for solving this problem. \nThe performance of these algorithms is usually affected by the selection of the ac(cid:173)\ntivation functions for the formal neurons in the networks. However, all activation \n\n\u00b7Lab. for Information Representation, FRP, RIKEN, Wako-shi, Saitama, JAPAN \n\n\f758 \n\nS. AMARI, A. CICHOCKI, H. H. YANG \n\nfunctions attempted are monotonic and the selections of the activation functions \nare ad hoc. How should the activation function be determined to minimize the MI? \nIs it necessary to use monotonic activation functions for blind signal separation? In \nthis paper, we shall answer these questions and give an on-line learning algorithm \nwhich uses a non-monotonic activation function selected by the independent com(cid:173)\nponent analysis (ICA) [7]. Moreover, we shall show a rigorous way to derive the \nlearning algorithm which has the equivariant property, i.e., the performance of the \nalgorithm is independent of the scaling parameters in the noiseless case. \n\n2 PROBLEM \n\nLet us consider unknown source signals Si(t), i = 1\"\", n which are mutually in(cid:173)\ndependent. It is assumed that the sources Si(t) are stationary processes and each \nsource has moments of any order with a zero mean. The model for the sensor output \nis \n\nx(t) = As(t) \n\nis an unknown non-singular mixing matrix, set) \n\nwhere A E R nxn \n[Sl(t),\u00b7 .. , sn(t)]T and x(t) = [Xl(t), .. \u00b7, xn(t)JT. \nWithout knowing the source signals and the mixing matrix, we want to recover the \noriginal signals from the observations x(t) by the following linear transform: \n\nyet) = Wx(t) \n\nwhere yet) = [yl(t), ... , yn(t)]T and WE R nxn is a de-mixing matrix. \nIt is impossible to obtain the original sources Si(t) because they are not identifiable \nin the statistical sense. However, except for a permutation of indices, it is possible \nto obtain CiSi(t) where the constants Ci are indefinite nonzero scalar factors. The \nsource signals are identifiable in this sense. So our goal is to find the matrix W such \nthat [yl, ... , yn] coincides with a permutation of [Sl, ... ,sn] except for the scalar \nfactors. The solution W is the matrix which finds all independent components in \nthe outputs. An on-line learning algorithm for W is needed which performs the \nICA. It is possible to find such a learning algorithm which minimizes the dependency \namong the outputs. The algorithm in [6] is based on the Edgeworth expansion[8] for \nevaluating the marginal negentropy. Both the Gram-Charlier expansion[8] and the \nEdgeworth expansion[8] can be used to approximate probability density functions. \nWe shall use the Gram-Charlier expansion instead of the Edgeworth expansion for \nevaluating the marginal entropy. We shall explain the reason in section 3. \n\n3 \n\nINDEPENDENCE OF SIGNALS \n\nThe mathematical framework for the ICA is formulated in [6]. The basic idea of the \nICA is to minimize the dependency among the output components. The dependency \nis measured by the Kullback-Leibler divergence between the joint and the product \nof the marginal distributions of the outputs: \n\nD(W) = \n\np(y) \n\n( a) dy \n\nJ \np(y) log rr \n\na=lPa y \n\n(1) \n\nwhere Pa(ya) is the marginal probability density function (pdf). Note the Kullback(cid:173)\nLeibler divergence has some invariant properties from the differential-geometrical \npoint of view[l]. \n\n\fA New Learning Algorithm for Blind Signal Separation \n\n759 \n\nIt is easy to relate the Kullback-Leibler divergence D(W) to the average MI of y: \n\nD(W) = -H(y) + LH(ya) \n\nn \n\na=l \n\n(2) \n\nwhere \n\nH(y) = - J p(y) logp(y)dy, \nH(ya) = - J Pa(ya)logPa(ya)dya is the marginal entropy. \n\nThe minimization of the Kullback-Leibler divergence leads to an ICA algorithm for \nestimating W in [6] where the Edgeworth expansion is used to evaluate the negen(cid:173)\ntropy. We use the truncated Gram-Charlier expansion to evaluate the Kullback(cid:173)\nLeibler divergence. The Edgeworth expansion has some advantages over the Gram(cid:173)\nCharlier expansion only for some special distributions. In the case of the Gamma \ndistribution or the distribution of a random variable which is the sum of iid random \nvariables, the coefficients of the Edgeworth expansion decrease uniformly. However, \nthere is no such advantage for the mixed output ya in general cases. \nTo calculate each H(ya) in (2), we shall apply the Gram-Charlier expansion to \napproximate the pdf Pa(ya). Since E[y] = E[W As] = 0, we have E[ya] = 0. To \nsimplify the calculations for the entropy H(ya) to be carried out later, we assume \nm2 = 1. We use the following truncated Gram-Charlier expansion to approximate \nthe pdf Pa(ya): \n\n(3) \nwhere lI;a = ma, 11;4 = m4 - 3, mk = E[(ya)k] is the k-th order moment of ya, \na(y) = ~e-lIi-, and Hk(Y) are Chebyshev-Hermite polynomials defined by the \nidentity \n\n2 \n\nWe prefer the Gram-Charlier expansion to the Edgeworth expansion because the \nformer clearly shows how lI;a and 11;4 affect the approximation of the pdf. The last \nterm in (3) characterizes non-Gaussian distributions. To apply (3) to calculate \nH(ya), we need the following integrals: \n\n- / a(y)H2(y)loga(y)dy = ~ \nJ a(y)(H2(y))2 H4(y)dy = 24. \n\n(4) \n\n(5) \n\nThese integrals can be obtained easily from the following results for the moments \nof a Gaussian random variable N(O,l): \n\n/ y2k+1a(y)dy = 0, \n\n/ y2ka(y)dy = 1\u00b73\u00b7\u00b7\u00b7 (2k - 1). \n\n(6) \n\nBy using the expansion \n\nlog(l + y) ~ y - 2 + O(y3) \n\ny2 \n\nand taking account of the orthogonality relations of the Chebyshev-Hermite poly(cid:173)\nnomials and (4)-(5), the entropy H(ya) is expanded as \n\nH(ya) ~ -log(27re) __ 3 ___ 4_ + _(lI;a)2I1;a + _(lI;a)3. \n\n(7) \n\n1 \n16 \n\n4 \n\n3 \n\n4 \n\n1 \n2 \n\n(lI;a)2 \n2 . 3! \n\n(lI;a)2 \n2 . 4! \n\n5 \n8 \n\n\f760 \n\nS. AMARI, A. CICHOCKI, H. H. YANG \n\nIt is easy to calculate -J a(y)loga(y)dy = ~ log(27re). \n\nFrom y = Wx, we have H(y) = H(x) + log Idet(W)I. Applying (7) and the above \nexpressions to (2), we have \n\n(Ka)2 \nn \nD(W) ~ -H(x) -log Idet(W)1 + -log(27re) - \"[_3_, + ~4' \n~ 2 \u00b73. 2\u00b7. \na=l \n\n(Ka)2 \n\nn \n2 \n\n4 A NEW LEARNING ALGORITHM \n\nTo obtain the gradient descent algorithm to update W recursively, we need to \ncalculate 88.0.. where wk' is the (a,k) element of W in the a-th row and k-th column. \n\nWI. \n\nLet cof(wk) be the cofactor of wk' in W. It is not difficult to derive the followings: \n\n(8) \n\n8log [det(W)[ _ \n-\n\n8wI: \n\n81t3 _ \n8w;: -\n81t; _ \n8wl: -\n\ncof(wk') = (W-Tt \ndet(W) \nk \n3E[(ya)2x k] \n\n4E[(ya)3x k] \n\nwhere (W-T)k' denotes the (a,k) element of (WT)-l. From (8), we obtain \n;!!a ~ -(W-T)k' + f(K'3, K~)E[(ya)2xk] + g(K'3, K~)E[(ya)3xk] \n\nk \n\n(9) \n\nwhere \n\nf(y, z) = -~y + l1yz, g(y, z) = -~z + ~y2 + ~z2. \n\nFrom (9), we obtain the gradient descent algorithm to update W recursively: \n\n\" \n\nd~1s = \n\noD \n-TJ( t)--oWk' \nTJ(t){(W - T)k - f(K'3, K~)E[(ya)2xk]_ g(K'3, K~)E[(ya)3xk]} (10) \n\nwhere TJ(t) is a learning rate function. Replacing the expectation values in (10) by \ntheir instantaneous values, we have the stochastic gradient descent algorithm: \n\nd~k = TJ(t){(W-T)k' - f(K'3, K~)(ya)2xk - g(K'3, K~)(ya)3xk}. \n\n(11) \n\nWe need to use the following adaptive algorithm to compute K'3 and K~ in (11): \n\ndK a \ndt = -J.'(t)(K'3 - (ya)3) \ndKa \nd/ = -J.'(t)(K~ - (ya)4 + 3) \n\n(12) \n\nwhere 1'( t) is another learning rate function. \nThe performance of the algorithm (11) relies on the estimation of the third and \nfourth order cumulants performed by the algorithm (12). Replacing the moments \n\n\fA New Learning Algorithm for Blind Signal Separation \n\n761 \n\nofthe random variables in (11) by their instantaneous values, we obtain the following \nalgorithm which is a direct but coarse implementation of (11): \n\ndw a \ndt = 1](t){(W-T)~ -\n\nf(ya)x k} \n\nwhere the activation function f(y) is defined by \n14 7 \n\n3 11 \n\nf() \n\n29 3 \nY = 4Y + 4 Y -\"3Y - 4 Y + 4Y . \n\n25 9 \n\n47 5 \n\n(13) \n\n(14) \n\nNote the activation function f(y) is an odd function, not a monotonic function. \nThe equation (13) can be written in a matrix form: \n\nThis equation can be further simplified as following by substituting xTWT = yT: \n\n(15) \n\nwhere f(y) = (f(yl), ... , f(yn))T. The above equation is based on the gradient \ndescent algorithm (10) with the following matrix form: \n\ndW \naD \ndt = -1](t) aw' \n\n(17) \n\n(16) \n\nFrom information geometry perspective[l], since the mixing matrix A is non(cid:173)\nsingular we had better replace the above algorithm by the following natural gradient \ndescent algorithm: \n\nApplying the previous approximation of the gradient :& to (18), we obtain the \n\ndW \ndt = -1](t)aw W w. \n\naD \n\nT \n\n(18) \n\nfollowing algorithm: \n\nwhich has the same \"equivariant\" property as the algorithms developed in [4, 5]. \n\n(19) \n\nAlthough the on-line learning algorithms (16) and (19) look similar to those in \n[3, 7] and [5] respectively, the selection of the activation function in this paper is \nrational, not ad hoc. The activation function (14) is determined by the leA. It is \na non-monotonic activation function different from those used in [3, 5, 7]. \nThere is a simple way to justify the stability of the algorithm (19). Let Vec(\u00b7) \ndenote an operator on a matrix which cascades the columns of the matrix from the \nleft to the right and forms a column vector. Note this operator has the following \nproperty: \n\n(20) \nBoth the gradient descent algorithm and the natural gradient descent algorithm are \nspecial cases of the following general gradient descent algorithm: \n\nVec(ABC) = (CT 0 A)Vec(B). \n\ndVec(W) = _ (t)P \n\naD \n\ndt \n\n1] \n\naVec(W) \n\n(21) \n\nwhere P is a symmetric and positive definite matrix. It is trivial that (21) becomes \n(17) when P = I. When P = WTW 0 I, applying (20) to (21), we obtain \n\ndVec(W) \n\ndt \n\n= -1]( t)(W W 0 I) aVec(W) = -1]( t)Vec( aw W W) \n\nT \n\naD \n\naD \n\nT \n\n\f762 \n\ns. AMARI. A. CICHOCKI. H. H. YANG \n\nand this equation implies (18). So the natural gradient descent algorithm updates \nWet) in the direction of decreasing the dependency D(W). The information geom(cid:173)\netry theory[l] explains why the natural gradient descent algorithm should be used \nto minimize the MI. \n\nAnother on-line learning algorithm for blind separation using recurrent network was \nproposed in [2]. For this algorithm, the activation function (14) also works well. \nIn practice, other activation functions such as those proposed in [2]-[6] may also be \nused in (19). However, the performance of the algorithm for such functions usually \ndepends on the distributions of the sources. The activation function (14) works for \nrelatively general cases in which the pdf of each source can be approximated by the \ntruncated Gram-Charlier expansion. \n\n5 SIMULATION \n\nIn order to check the validity and performance of the new on-line learning algorithm \n(19), we simulate it on the computer using synthetic source signals and a random \nmixing matrix. The extensive computer simulations have fully confirmed the theory \nand the validity of the algorithm (19). Due to the limit of space we present here \nonly one illustrative example. \n\nExample: \n\nAssume that the following three unknown sources are mixed by a random mixing \nmatrix A: \n\n[SI (t), S2(t), S3(t)] = [n(t), O.lsin( 400t)cos(30t), 0.01sign[sin(500t + 9cos( 40t))] \n\nwhere net) is a noise source uniformly distributed in the range [-1, +1], and S2(t) \nand S3(t) are two deterministic source signals. The elements of the mixing matrix \nA are randomly chosen in [-1, +1]. The learning rate is exponentially decreaSing \nto zero as rJ(t) = 250exp( -5t). \nA simulation result is shown in Figure 1. The first three signals denoted by Xl, \nX2 and X3 represent mixing (sensor) signals: x l (t), x2(t) and x3(t). The last \nthree signals denoted by 01, 02 and 03 represent the output signals: yl(t), y2(t), \nand y3(t). By using the proposed learning algorithm, the neural network is able \nto extract the deterministic signals from the observations after approximately 500 \nmilliseconds. \nThe performance index El is defined by \n\n- 1) \n\nEl = tct IPijl \n\ni=1 j=1 maxk IPikl \n\n- 1) + tct IPijl \n\nj=l i=l maxk IPkjl \n\nwhere P = (Pij) = WA. \n\n6 CONCLUSION \n\nThe major contribution of this paper the rigorous derivation of the effective blind \nseparation algorithm with equivariant property based on the minimization of the \nMI of the outputs. The ICA is a general principle to design algorithms for blind \nsignal separation. The most difficulties in applying this principle are to evaluate \nthe MI of the outputs and to find a working algorithm which decreases the MI. \nDifferent from the work in [6], we use the Gram-Charlier expansion instead of the \nEdgeworth expansion to calculate the marginal entropy in evaluating the MI. Using \n\n\fA New Learning Algorithm for Blind Signal Separation \n\n763 \n\nthe natural gradient method to minimize the MI, we have found an on-line learning \nalgorithm to find a de-mixing matrix. The algorithm has equivariant property and \ncan be easily implemented on a neural network like model. Our approach provides \na rational selection of the activation function for the formal neurons in the network. \nThe algorithm has been simulated for separating unknown source signals mixed by \na random mixing matrix. Our theory and the validity of the new learning algorithm \nare verified by the simulations. \n\no. \n04 \n0' \no \n\nI \n\nFigure 1: The mixed and separated signals, and the performance index \n\nAcknowledgment \nWe would like to thank Dr. Xiao Yan SU for the proof-reading of the manuscript. \n\nReferences \n\n[1] S.-I. Amari. Differential-Geometrical Methods in Statistics, Lecture Notes in \n\nStatistics vol.28. Springer, 1985. \n\n[2] S. Amari, A. Cichocki, and H. H. Yang. Recurrent neural networks for blind sep(cid:173)\n\naration of sources. In Proceedings 1995 International Symposium on Nonlinear \nTheory and Applications, volume I, pages 37-42, December 1995. \n\n[3] A. J. Bell and T . J . Sejnowski. An information-maximisation approach to blind \n\nseparation and blind deconvolution. Neural Computation, 7:1129-1159, 1995. \n\n[4] J.-F. Cardoso and Beate Laheld. Equivariant adaptive source separation. To \n\nappear in IEEE Trans. on Signal Processing, 1996. \n\n[5] A. Cichocki, R. Unbehauen, L. MoszczyIiski, and E. Rummert. A new on-line \nadaptive learning algorithm for blind separation of source signals. In ISANN94, \npages 406-411, Taiwan, December 1994. \n\n[6] P. Comon. Independent component analysis, a new concept? Signal Processing, \n\n36:287-314, 1994. \n\n[7] C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive \nalgorithm based on neuromimetic architecture. Signal Processing, 24:1- 10, 1991. \n[8] A. Stuart and J. K. Ord. Kendall's Advanced Theory of Statistics. Edward \n\nArnold, 1994. \n\n\f", "award": [], "sourceid": 1115, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}, {"given_name": "Andrzej", "family_name": "Cichocki", "institution": null}, {"given_name": "Howard", "family_name": "Yang", "institution": null}]}