{"title": "Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 619, "abstract": null, "full_text": "Maximum Likelihood Blind Source \n\nSeparation: A Context-Sensitive \n\nGeneralization of ICA \n\nBarak A. Pearlmutter \n\nLucas C. Parra \n\nComputer Science Dept, FEC 313 \n\nSiemens Corporate Research \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \n\nbap@cs.unm.edu \n\n755 College Road East \n\nPrinceton, N J 08540-6632 \n\nlucas@scr.siemens.com \n\nAbstract \n\nIn the square linear blind source separation problem, one must find \na linear unmixing operator which can detangle the result Xi(t) of \nmixing n unknown independent sources 8i(t) through an unknown \nn x n mixing matrix A( t) of causal linear filters: Xi = E j aij * 8 j . \nWe cast the problem as one of maximum likelihood density estima(cid:173)\ntion, and in that framework introduce an algorithm that searches \nfor independent components using both temporal and spatial cues. \nWe call the resulting algorithm \"Contextual ICA,\" after the (Bell \nand Sejnowski 1995) Infomax algorithm, which we show to be a \nspecial case of cICA. Because cICA can make use of the temporal \nstructure of its input, it is able separate in a number of situations \nwhere standard methods cannot, including sources with low kur(cid:173)\ntosis, colored Gaussian sources, and sources which have Gaussian \nhistograms. \n\n1 The Blind Source Separation Problem \n\nConsider a set of n indepent sources 81 (t), . .. ,8n (t). We are given n linearly dis(cid:173)\ntorted sensor reading which combine these sources, Xi = E j aij8j, where aij is a \nfilter between source j and sensor i, as shown in figure 1a. This can be expressed \nas \n\nXi(t) = 2: 2: aji(r)8j(t - r) = 2: aji * 8j \n\n00 \n\nj \n\nr=O \n\nj \n\n\f614 \n\nB. A. Pearlmutter and L. C. Parra \n\nIftY/(t )IY/(t-l) \u2022 ... ;,,(1\u00bb f-h-..... ~,,---------..,------~ \nf, \n~Y~~-+~-+-~ \n\n11\"-\n\nx. \n\nFigure 1: The left diagram shows a generative model of data production for blind \nsource separation problem. The cICA algorithm fits the reparametrized generative \nmodel on the right to the data. Since (unless the mixing process is singular) both \ndiagrams give linear maps between the sources and the sensors, they are mathe(cid:173)\nmaticallyequivalent. However, (a) makes the transformation from s to x explicit, \nwhile (b) makes the transformation from x to y, the estimated sources, explicit. \n\nor, in matrix notation, x(t) = L~=o A(T)S(t - T) = A * s. The square linear blind \nsource separation problem is to recover S from x. There is an inherent ambiguity \nin this, for if we define a new set of sources s' by s~ = bi * Si where bi ( T) is some \ninvertable filter, then the various s~ are independent, and constitute just as good a \nsolution to the problem as the true Si, since Xi = Lj(aij * bjl) * sj. Similarly the \nsources could be arbitrarily permuted. \n\nSurprisingly, up to permutation of the sources and linear filtering of the individual \nsources, the problem is well posed-assuming that the sources Sj are not Gaussian. \nThe reason for this is that only with a correct separation are the recovered sources \ntruly statistically independent, and this fact serves as a sufficient constraint. Under \nthe assumptions we have made, I and further assuming that the linear transforma(cid:173)\ntion A is invertible, we will speak of recovering Yi(t) = Lj Wji * Xj where these \nYi are a filtered and permuted version of the original unknown Si. For clarity of \nexposition, will often refer to \"the\" solution and refer to the Yi as \"the\" recovered \nsources, rather than refering to an point in the manifold of solutions and a set of \nconsistent recovered sources. \n\n2 Maximum likelihood density estimation \n\nFollowing Pham, Garrat, and Jutten (1992) and Belouchrani and Cardoso (1995), \nwe cast the BSS problem as one of maximum likelihood density estimation. In the \nMLE framework, one begins with a probabilistic model of the data production pro(cid:173)\ncess. This probabilistic model is parametrized by a vector of modifiable parameters \nw, and it therefore assigns a w-dependent probability density p( Xo, Xl, ... ; w) to a \neach possible dataset xo, Xl, .... The task is then to find a w which maximizes this \nprobability. \n\nThere are a number of approaches to performing this maximization. Here we apply \n\nlWithout these assumptions, for instance in the presence of noise, even a linear mixing \n\nprocess leads to an optimal un mixing process that is highly nonlinear. \n\n\fMaximum Likelihood Blind Source Separation: ContextuallCA \n\n615 \n\nthe stochastic gradient method, in which a single stochastic sample x is chosen from \nthe dataset and -dlogp(x; w)/dw is used as a stochastic estimate of the gradient \nof the negative likelihood 2:t -dlogp(x(t); w)/dw. \n\n2.1 The likelihood of the data \n\nThe model of data production we consider is shown in figure 1a. In that model, the \nsensor readings x are an explicit linear function of the underlying sources s. \n\nIn this model of the data production, there are two stages. In the first stage, the \nsources independently produce signals. These signals are time-dependent, and the \nprobability density of source i producing value Sj(t) at time t is f;(Sj(t)lsj(t -\n1), Sj(t - 2), ... ). Although this source model could be of almost any differentiable \nform, we used a generalized autoregressive model described in appendix A. For \nexpository purposes, we can consider using a simple AR model, so we model Sj(t) = \nbj (l)sj(t -1) + bj(2)sj(t - 2) + ... + bj(T)sj(t - T) + Tj, where Tj is an iid random \nvariable, perhaps with a complicated density. \n\nIt is important to distinguish two different, although related, linear filters . When \nthe source models are simple AR models, there are two types of linear convolutions \nbeing performed. The first is in the way each source produces its signal: as a linear \nfunction of its recent history plus a white driving term, which could be expressed \nas a moving average model, a convolution with a white driving term, Sj = bj * Tj. \nThe second is in the way the sources are mixed: linear functions of the output of \neach source are added, Xi = 2: j aji * Sj = 2: j (aji * bj) *Tj. Thus, with AR sources, \nthe source convolution could be folded into the convolutions of the linear mixing \nprocess. \n\nIf we were to estimate values for the free parameters of this model, i.e. to estimate \nthe filters, then the task of recovering the estimated sources from the sensor output \nwould require inverting the linear A = (aij), as well as some technique to guarantee \nits non-singularity. Such a model is shown in figure 1a. Instead, we parameterize \nthe model by W = A-I, an estimated unmixing matrix, as shown in figure lb. \nIn this indirect representation, s is an explicit linear function of x, and therefore \nx is only an implicit linear function of s. This parameterization of the model is \nequally convenient for assigning probabilities to samples x, and is therefore suitable \nfor MLE. Its advantage is that because the transformation from sensors to sources \nis estimated explicitly, the sources can be recovered directly from the data and the \nestimated model, without invertion. Note that in this inverse parameterization, the \nestimated mixture process is stored in inverse form. The source-specific models Ii \nare kept in forward form. Each source-specific model i has a vector of parameters, \nwhich we denote w(i). \n\nWe are now in a position to calculate the likelihood of the data. For simplicity we use \na matrix W of real numbers rather than FIR filters. Generalizing this derivation to \na matrix of filters is straightforward, following the same techniques used by Lambert \n(1996), Torkkola (1996), A. Bell (1997), but space precludes a derivation here. \n\nThe individual generative source models give \n\np(y(t)ly(t - 1), y(t - 2), ... ) = II Ii(Yi(t)IYi(t - 1), Yi(t - 2), ... ) \n\n(1) \n\n\f616 \n\nB. A. Pear/mutter and L. C. Parra \n\nwhere the probability densities h are each parameterized by vectors w(i). Using \nthese equations, we would like to express the likelihood of x(t) in closed form, \ngiven the history x(t - 1), x(t - 2), .... Since the history is known, we therefore \nalso know the history of the recovered sources, y(t - 1),y(t - 2), .... This means \nthat we can calculate the density p(y(t)ly(t - 1), . .. ). Using this, we can express \nthe density of x(t) and expand G = logp(x; w) = log IWI + 2:j log fj(Yj(t)IYj(t -\n1), Yj(t - 2), ... ; wU\u00bb) There are two sorts of parameters which we must take the \nderivative with respect to: the matrix W and the source parameters wU). The \nsource parameters do not influence our recovered sources, and therefore have a \nsimple form \n\ndG \ndWj \n\ndfJ(Yj;wj)/dwj \n\nfj(Yj; Wj) \n\nHowever, a change to the matrix W changes y, which introduces a few extra terms. \nNote that dlog IWI/dW = W- T , the transpose inverse. Next, since y = Wx, we \nsee that dYj/dW = (OlxIO)T, a matrix of zeros except for the vector x in row j . \nNow we note that dfJO/dW term has two logical components: the first from the \neffect of changing W upon Yj(t), and the second from the effect of changing W upon \nYj(t -1), Yj(t - 2), .... (This second is called the \"recurrent term\", and such terms \nare frequently dropped for convenience. As shown in figure 3, dropping this term \nhere is not a reasonable approximation.) \n\ndfJ(Yj(t)IYj(t-1), ... ;wj) = afj dYj(t) + 2: \n\nafJ \n\ndYj(t-T) \n\ndW \n\naYj(t) dW \n\naYj(t - T) \n\ndW \n\nT \n\nNote that the expression dYij:;T) is the only matrix, and it is zero except for the \njth row, which is x(t - T). The expression afJ/aYj(t) we shall denote fjO, and the \nexpression afjaYj(t - T) we shall denote f(T}(.). We then have \n\n! = _W- T - (f~(:)) x(tf - f (ft}:\u00b7)) x(t - Tf \n\nfJ() \n\nj \n\nT=l \n\nfJ() \n\nj \n\n(2) \n\nwhere (expr(j))j denotes the column vector whose elements are expr(1) , . .. , expr(n). \n\n2.2 The natural gradient \n\nFollowing Amari, Cichocki, and Yang (1996), we follow a pseudogradient. Instead of \nusing equation 2, we post-multiply this quantity by WTW. Since this is a positive(cid:173)\ndefinite matrix, it does not affect the stochastic gradient convergence criteria, and \nthe resulting quantity simplifies in a fashion that neatly eliminates the costly matrix \ninversion otherwise required. Convergence is also accelerated. \n\n3 Experiments \n\nWe conducted a number of experiments to test the efficacy of the cICA algorithm. \nThe first, shown in figure 2, was a toy problem involving a set of processed de(cid:173)\nliberately constructed to be difficult for conventional source separation algorithms. \nIn the second experiment, shown in figure 3, ten real sources were digitally mixed \nwith an instantaneous matrix and separation performance was measured as a funci(cid:173)\nton of varying model complexity parameters. These sources have are available for \nbenchmarking purposes in http://www.cs.unm.edu;-bap/demos.html. \n\n\fMaximum Likelihood Blind Source Separation: ContextuallCA \n\n617 \n\nFigure 2: cICA using a history of one time step and a mixture of five logistic densities \nfor each source was applied to 5,000 samples of a mixture of two one-dimensional \nuniform distributions each filtered by convolution with a decaying exponential of \ntime constant of 99.5. Shown is a scatterplot of the data input to the algorithm, \nalong with the true source axes (left), the estimated residual probability density \n(center), and a scatterplot of the residuals of the data transformed into the estimated \nsource space coordinates (right). The product of the true mixing matrix and the \nestimated unmixing matrix deviates from a scaling and permutation matrix by \nabout 3%. \n\nNoise Model \n\nTruncated Gradient \n\nFull Gradient \n\n100 \n\n\u00b78 \nII \n~ \n10 \n\no \n\n5 \nt5 \nnumber 01 AR filter taps \n\n10 \n\n20 \n\no \n\n5 \n15 \nnumber 01 AR filter taps \n\n10 \n\n20 \n\nnumber oIlogistica \n\n2 \n\nFigure 3: The performance of cICA as a function of model complexity and gradient \naccuracy. In all simulations, ten five-second clips taken digitally from ten audio CD \nwere digitally mixed through a random ten-by-ten instantanious mixing matrix. The \nsignal to noise ratio of each original source as expressed in the recovered sources is \nplotted. In (a) and (b), AR source models with a logistic noise term were used, and \nthe number of taps of the AR model was varied. (This reduces to Bell-Sejnowski \ninfomax when the number of taps is zero.) Is (a), the recurrent term of the gradient \nwas left out, while in (b) the recurrent term was included. Clearly the recurrent \nterm is important. In (c), a degenerate AR model with zero taps was used, but the \nnoise term was a mixture of logistics, and the number of logistics was varied. \n\n4 Discussion \n\nThe Infomax algorithm (Baram and Roth 1994) used for source separation (Bell and \nSejnowski 1995) is a special case of the above algorithm in which (a) the mixing is \nnot convolutional, so W(l) = W(2) = ... = 0, and (b) the sources are assumed to \nbe iid, and therefore the distributions fi(y(t)) are not history sensitive. Further, \nthe form of the Ii is restricted to a very special distribution: the logistic density, \n\n\f618 \n\nB. A. Pearlmuner and L. C. Parra \n\nthe derivative of the sigmoidal function 1/{1 + exp -{). Although ICA has enjoyed \na variety of applications (Makeig et al. 1996; Bell and Sejnowski 1996b; Baram and \nRoth 1995; Bell and Sejnowski 1996a), there are a number of sources which it cannot \nseparate. These include all sources with Gaussian histograms (e.g. colored gaussian \nsources, or even speech to run through the right sort of slight nonlinearity), and \nsources with low kurtosis. As shown in the experiments above, these are of more \nthan theoretical interest. \n\nIf we simplify our model to use ordinary AR models for the sources, with gaussian \nnoise terms of fixed variance, it is possible to derive a closed-form expression for \nW (Hagai Attias, personal communication). It may be that for many sources of \npractical interest, trading away this model accuracy for speed will be fruitful. \n\n4.1 Weakened assumptions \n\nIt seems clear that, in general, separating when there are fewer microphones than \nsources requires a strong bayesian prior, and even given perfect knowledge of the \nmixture process and perfect source models, inverting the mixing process will be \ncomputationally burdensome. However, when there are more microphones than \nsources, there is an opportunity to improve the performance of the system in the \npresence of noise. This seems straightforward to integrate into our framework. \nSimilarly, fast-timescale microphone nonlinearities are easily incorporated into this \nmaximum likelihood approach. \n\nThe structure of this problem would seem to lend itself to EM. Certainly the individ(cid:173)\nual source models can be easily optimized using EM, assuming that they themselves \nare of suitable form. \n\nReferences \n\nA. Bell, T.-W. L. (1997). Blind separation of delayed and convolved sources. In \nAdvances in Neural Information Processing Systems 9. MIT Press. In this \nvolume. \n\nAmari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind \nsignal separation. In Advances in Neural Information Processing Systems 8. \nMIT Press. \n\nBaram, Y. and Roth, Z. (1994). Density Shaping by Neural Networks with Ap(cid:173)\nplication to Classification, Estimation and Forecasting. Tech. rep. CIS-94-\n20, Center for Intelligent Systems, Technion, Israel Institute for Technology, \nHaifa. \n\nBaram, Y. and Roth, Z. (1995). Forecasting by Density Shaping Using Neural \nNetworks. In Computational Intelligence for Financial Engineering New York \nCity. IEEE Press. \n\nBell, A. J. and Sejnowski, T. J. (1995). An Information-Maximization Approach \nto Blind Separation and Blind Deconvolution. Neural Computation, 7(6), \n1129-1159. \n\nBell, A. J. and Sejnowski, T. J. (1996a). The Independent Components of Natural \n\nScenes. Vision Research. Submitted. \n\n\fMaximum Likelihood Blind Source Separation: ContextuallCA \n\n619 \n\nBell, A. J. and Sejnowski, T. J. (1996b). Learning the higher-order structure of a \n\nnatural sound. Network: Computation in Neural Systems. In press. \n\nBelouchrani, A. and Cardoso, J.-F. (1995). Maximum likelihood source separation \n\nby the expectation-maximization technique: Deterministic and stochastic im(cid:173)\nplementation. In Proceedings of 1995 International Symposium on Non-Linear \nTheory and Applications, pp. 49- 53 Las Vegas, NV. In press. \n\nLambert, R. H. (1996). Multichannel Blind Deconvolution: FIR Matrix Algebra and \n\nSeparation of Multipath Mixtures. Ph.D. thesis, USC. \n\nMakeig, S., Anllo-Vento, L., Jung, T.-P., Bell, A. J., Sejnowski, T. J., and Hillyard, \nIndependent component analysis of event-related potentials \n\nS. A. (1996). \nduring selective attention. Society for Neuroscience Abstracts, 22. \n\nPearlmutter, B. A. and Parra, L. C. (1996). A Context-Sensitive Generaliza(cid:173)\n\ntion of ICA. In International Conference on Neural Information Processing \nHong Kong. Springer-Verlag. Url ftp:/ /ftp.cnl.salk.edu/pub/bap/iconip-96-\ncica.ps.gz. \n\nPham, D., Garrat, P., and Jutten, C. (1992). Separation of a mixture of indepen(cid:173)\n\ndent sources through a maximum likelihood approach. In European Signal \nProcessing Conference, pp. 771-774. \n\nTorkkola, K. (1996). Blind separation of convolved sources based on information \nmaximization. In Neural Networks for Signal Processing VI Kyoto, Japan. \nIEEE Press. In press. \n\nA Fixed mixture AR models \n\nThe fj{uj; Wj) we used were a mixture AR processes driven by logistic noise terms, \nas in Pearlmutter and Parra (1996). Each source model was \n\nfj{Uj{t)IUj{t -1), Uj{t - 2), ... ; Wj) = I: mjk h{{u){t) - Ujk)/Ujk)/Ujk \n\n(3) \n\nk \n\nwhere Ujk is a scale parameter for logistic density k of source j and is an element \nof Wj, and the mixing coefficients mjk are elements of Wj and are constrained by \n'Ek mjk = 1. The component means Ujk are taken to be linear functions of the \nrecent values of that source, \n\nUjk = L ajk(r) Uj{t - r) + bjk \n\nT=l \n\n(4) \n\nwhere the linear prediction coefficients ajk{r) and bias bjk are elements of Wj' \nThe derivatives of these are straightforward; see Pearlmutter and Parra (1996) for \ndetails. One complication is to note that, after each weight update, the mixing \ncoefficients must be normalized, mjk t- mjk/ 'Ekl mjk' . \n\n\f", "award": [], "sourceid": 1179, "authors": [{"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}, {"given_name": "Lucas", "family_name": "Parra", "institution": null}]}