{"title": "Estimating the Reliability of ICA Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 1181, "page_last": 1188, "abstract": null, "full_text": "Estimating the Reliability of leA \n\nProjections \n\nF. Meinecke l ,2, A. Ziehe l , M. Kawanabe l and K.-R. Miiller l ,2* \n\n1 Fraunhofer FIRST.IDA, Kekuh~str. 7, 12489 Berlin, Germany \n\n2University of Potsdam, Am Neuen Palais 10, 14469 Potsdam, Germany \n\n{meinecke,ziehe,nabe,klaus}\u00a9first.fhg.de \n\nAbstract \n\nWhen applying unsupervised learning techniques like ICA or tem(cid:173)\nporal decorrelation, a key question is whether the discovered pro(cid:173)\njections are reliable. In other words: can we give error bars or can \nwe assess the quality of our separation? We use resampling meth(cid:173)\nods to tackle these questions and show experimentally that our \nproposed variance estimations are strongly correlated to the sepa(cid:173)\nration error. We demonstrate that this reliability estimation can \nbe used to choose the appropriate ICA-model, to enhance signifi(cid:173)\ncantly the separation performance, and, most important, to mark \nthe components that have a actual physical meaning. Application \nto 49-channel-data from an magneto encephalography (MEG) ex(cid:173)\nperiment underlines the usefulness of our approach. \n\n1 \n\nIntroduction \n\nBlind source separation (BSS) techniques have found wide-spread use in various \napplication domains, e.g. acoustics, telecommunication or biomedical signal pro(cid:173)\ncessing. (see e.g. [9, 5, 6, 1, 2, 4, 14, 8]). \nBSS is a statistical technique to reveal unknown source signals when only mixtures \nof them can be observed. In the following we will only consider linear mixtures; the \ngoal is then to estimate those projection directions, that recover the source signals. \nMany different BSS algorithms have been proposed, but to our knowledge, so far, \nno principled attempts have been made to assess the reliability of BSS algorithms, \nsuch that error bars are given along with the resulting projection estimates. This \nlack of error bars or means for selecting between competing models is of course a \nbasic dilemma for most unsupervised learning algorithms. The sources of potential \nunreliability of unsupervised algorithms are ubiquous, i.e. noise, non-stationarities, \nsmall sample size or inadequate modeling (e.g. sources are simply dependent in(cid:173)\nstead of independent). Unsupervised projection techniques like PCA or BSS will \nalways give an answer that is found within their model class, e.g. PCA will supply \nan orthogonal basis even if the correct modeling might be non-orthogonal. But how \ncan we assess such a miss-specification or a large statistical error? \nOur approach to this problem is inspired by the large body of statistics literature on \n\n\u2022 To whom correspondence should be addressed. \n\n\fresampling methods (see [12] or [7] for references), where algorithms for assessing \nthe stability of the solution have been analyzed e.g. for peA [3]. \nWe propose reliability estimates based on bootstrap resampling. This will enable \nus to select a good BSS model, in order to improve the separation performance and \nto find potentially meaningful projection directions. In the following we will give \nan algorithmic description of the resampling methods, accompanied by some theo(cid:173)\nretical remarks (section 2) and show excellent experimental results (sections 3 and \n4). We conclude with a brief discussion. \n\n2 Resampling Techniques for BSS \n\n2.1 The leA Model \n\nIn blind source separation we assume that at time instant t each component Xi(t) \nof the observed n-dimensional data vector, x(t) is a linear superposition of m ::::: n \nstatistically independent signals: \n\nm \n\nXi(t) = LAijSj(t) \n\nj=l \n\n(e.g. [8]). The source signals Sj(t) are unknown, as are the coefficients Aij of the \nmixing matrix A. The goal is therefore to estimate both unknowns from a sample \nof the x(t), i.e. y(t) = s(t) = Wx(t), where W is called the separating matrix. \nSince both A and s(t) are unknown, it is impossible to recover the scaling or the \norder of the columns of the mixing matrix A. All that one can get are the projection \ndirections. The mixing/ demixing process can be described as a change of coordi(cid:173)\nnates. From this point of view the data vector stays the same, but is expressed \nin different coordinate systems (passive transformation). Let {ed be the canoni(cid:173)\ncal basis of the true sources s = 'E eiSi. Analogous, let {fj} be the basis of the \nestimated leA channels: y = 'E fjYj. Using this, we can define a component-wise \nseparation error Ei as the angle difference between the true direction of the source \nand the direction of the respective leA channel: \n\nEi = arccos (\"e~i: ~ifill) . \n\nTo calculate this angle difference, remember that component-wise we have Yj \n'E WjkAkisi. With Y = s, this leads to: fj = 'E ei(WA)ij1, i.e. fj is the j-th \ncolumn of (WA) - l. \nIn the following, we will illustrate our approach for two different source separation \nalgorithms (JADE, TDSEP). JADE [4] using higher order statistics is based on \nthe joint diagonalization of matrices obtained from 'parallel slices' of the fourth \norder cumulant tensor. TDSEP [14] relies on second order statistics only, enforcing \ntemporal decorrelation between channels. \n\n2.2 About Resampling \n\nThe objective of resampling techniques is to produce surrogate data sets that \neventually allow to approximate the 'separation error' by a repeated estimation of \nthe parameters of interest. The underlying mixing should of course be independent \nof the generation process of the surrogate data and therefore remain invariant \nunder resampling. \n\n\fBootstrap R esampling \n\nThe most popular res amp ling methods are the Jackknife and the Bootstrap \n(see e.g. [12, 7]) The Jackknife produces surrogate data sets by just deleting one \ndatum each time from the original data. There are generalizations of this approach \nlike k-fold cross-validation which delete more than one datum at a time. A more \ngeneral approach is the Bootstrap. Consider a block of, say, N data points. For \nobtaining one bootstrap sample, we draw randomly N elements from the original \ndata, i.e. some data points might occur several times, others don't occur at all in \nthe bootstrap sample. This defines a series {at} with each at telling how often \nthe data point x(t) has been drawn. Then, the separating matrix is computed on \nthe full block and repeatedly on each of the N -element bootstrap samples. The \nvariance is computed as the squared average difference between the estimate on \nthe full block and the respective bootstrap unmixings. (These resampling methods \nhave some desirable properties, which make them very attractive; for example, it \ncan be shown that for iid data the bootstrap estimators of the distributions of \nmany commonly used statistics are consistent.) It is straight forward to apply this \nprocedure to BSS algorithms that do not use time structure; however, only a small \nmodification is needed to take time structure into account. For example, the time \nlagged correlation matrices needed for TDSEP, can be obtained from {ad by \n\nCij(T) = N 2: at 'Xi(t)Xj(t+T) \n\n1 N \n\nwith L at = N and at E {O, 1, 2, ... }. \n\nt = l \n\nOther resampling methods \n\nBesides the Bootstrap, \nknife or cross-validation which can be understood as special cases of Bootstrap. \nWe have tried k-fold cross-validation, which yielded very similar results to the ones \nreported here. \n\nthere are other res amp ling methods like the Jack(cid:173)\n\n2.3 The Resampling Algorithm \n\nAfter performing BSS, the estimated ICA-projections are used to generate surro(cid:173)\ngate data by resampling. On the whitenedl surrogate data, the source separation \nalgorithm is used again to estimate a rotation that separates this surrogate data. \nIn order to compare different rotation matrices, we use the fact that the matrix \nrepresentation of the rotation group SO(N) can be parameterized by \n\nr5~r5t - r5~r5b , where the matrices Mij are generators of the group \nwith (Mab)ij \nand the aij are the rotation parameters (angles) of the rotation matrix R. Using \nthis parameterization we can easily compare different N-dimensional rotations by \ncomparing the rotation parameters aij. Since the sources are already separated, \nthe estimated rotation matrices will be in the vicinity of the identity matrix.2 . \n\nIThe whitening transformation is defined as x' = Vx with V = E[xxTtl/2. \n21t is important to perform the resampling when the sources are already separated, so \nthat the aij are distributed around zero, because SO(N) is a non-Abelian group; that \nmeans that in general R(a)R\u00ab(3) of- R\u00ab(3) R(a) . \n\n\fVar(aij) measures the instability of the separation with respect to a rotation in \nthe (i, j)-plane. Since the reliability of a projection is bounded by the maximum \nangle variance of all rotations that affect this direction, we define the uncertainty of \nthe i-th ICA-Projection as Ui := maxj Var(aij). Let us summarize the resampling \nalgorithm: \n\n1. Estimate the separating matrix W with some ICA algorithm. \n\nCalculate the ICA-Projections y = Wx \n\n2. Produce k surrogate data sets from y and whiten these data sets \n3. For each surrogate data set: do BSS, producing a set of rotation matrices \n4. Calculate variances of rotation parameters (angles) aij \n5. For each ICA component calculate the uncertainty Ui = maxVar(aij). \n\nJ \n\n2.4 Asymptotic Considerations for Resampling \n\nProperties of res amp ling methods are typically studied in the limit when the number \nof bootstrap samples B -+ 00 and the length of signal T -+ 00 [12]. In our case, as \nB -+ 00, the bootstrap variance estimator Ut(B) computed from the aiJ's converge \nto Ut(oo) := maxj Varp[aij] where aij denotes the res amp led deviation and F \ndenotes the distribution generating it. Furthermore, if F -+ F, Ut (00) converges to \nthe true variance Ui = maxj VarF[aij ] as T -+ 00. This is the case, for example, if \nthe original signal is i.i.d. in time. When the data has time structure, F does not \nnecessarily converge to the generating distribution F of the original signal anymore. \nAlthough we cannot neglect this difference completely, it is small enough to use our \nscheme for the purposes considered in this paper, e.g. in TDSEP, where the aij \ndepend on the variation of the time-lagged covariances Cij(T) of the signals, we can \nshow that their estimators Ctj (T) are unbiased: \n\nFurthermore, we can bound the difference t:.ijkl(T,V) = COVF [Cij(T),Ckl(V)] \nCOV p [Ctj ( T), Ckl (v)] between the covariance of the real matrices and their boot(cid:173)\nstrap estimators as \n\nif :3a < 1, M ;::: 1, Vi: \nbias is usually found to be much smaller than this upper bound. \n\nICii (T) I :S M aJLICii(O) I. In our experiments, however, the \n\n3 Experiments \n\n3.1 Comparing the separation error with the uncertainty estimate \n\nTo show the practical applicability of the resampling idea to ICA, the separation \nerror Ei was compared with the uncertainty Ui . The separation was performed on \ndifferent artificial 2D mixtures of speech and music signals and different iid data \nsets of the same variance. To achieve different separation qualities, white gaussian \nnoise of different intensity has been added to the mixtures. \n\n\f0.7 , - - - - - - - - - - - - - - - - - - - - - - -_ \n\nUj = 0.015 \n\nU. = 0.177 \n\n' - - -' j \n\n0.6 \n\nur \n~ 0.5 \n\n\u00a7 0.4 \n~ \n~0.3 \n\nc . ~ 0.2 \n\n0.1 \n\no L---~~~~~~~~--~ \n\no \n\n0.2 \n\n0.4 \n\n0.6 \nseparation error Ej \n\n0.8 \n\no L-----~----~----~--~ \n0.05 \n0.45 \n\n0.15 \n\n0.25 \n\n0.35 \n\nFigure 1: (a) The probability distribution for the separation error for a small uncertainty \nis close to zero, for higher uncertainty it spreads over a larger range. (b) The expected \nerror increases with the uncertainty. \n\nFigure 1 relates the uncertainty to the separation error for JADE (TDSEP results \nlook qualitatively the same) . In Fig.1 (left) we see the separation error distribution \nwhich has a strong peak for small values of our uncertainty measure, whereas for \nlarge uncertainties it tends to become flat, i.e. - as also seen from Fig.1 (right) -\nthe uncertainty reflects very well the true separation error. \n\n3.2 Selecting the appropriate BSS algorithm \n\nAs our variance estimation gives a high correlation to the (true) separation error, \nthe next logical step is to use it as a model selection criterion for: (a) selecting \nsome hyperparameter of the BSS algorithm, e.g. choosing the lag values for \nTDSEP or (b) choosing between a set of different algorithms that rely on different \nassumptions about the data, i.e. higher order statistics (e.g. JADE, INFO MAX, \nFastICA, ... ) or second order statistics (e.g. TDSEP). It could, in principle, be \nmuch better to extract the first component with one and the next with another \nassumption/ algorithm. To illustrate the usefulness of our reliability measure, we \nstudy a five-channel mixture of two channels of pure white gaussian noise, two audio \nsignals and one channel of uniformly distributed noise. The reliability analysis for \n\nhigher order statistics (JADE) \n\n0.3 \n\n0.25 \n\n~- 0.2 \nE \n:rg 0.15 \ng \n::J 0.1 \n\n0.05 \n\ntemporal decorrelation (TDSEP) \n\nTDSEP 3 \n9.17.10- 5 \n\n,----\n\n,----\n\nTDSEP 4 \n1.29.10-5 \n\n,----\n\n0.3 \n\n0.25 \n\n~- 0.2 \nE \n:rg 0.15 \ng \n::J 0 .1 \n\n0.05 \n\n3 \n\nICA Channel i \n\n3 \n\nICA Channel i \n\nFigure 2: Uncertainty of leA projections of an artificial mixture using JADE and TDSEP. \nResampling displays the strengths and weaknesses of the different models \n\nJADE gives the advice to rely only on channels 3,4,5 (d. Fig.2 left). In fact , these \nare the channels that contain the audio signals and the uniformly distributed noise. \nThe same analysis applied to the TDSEP-projections (time lag = 0, ... ,20) shows, \nthat TDSEP can give reliable estimates only for the two audio sources (which is \nto be expected; d. Fig.2 right). According to our measure, the estimation for the \naudio sources is more reliable in the TDSEP-case. Calculation of the separation \nerror verifies this: TDSEP separates better by about 3 orders of magnitude (JADE: \n\n\fE3 = 1.5 . 10- 1 , E4 = 1.4 . 10- 1 , TDSEP: E3 = 1.2 . 10- 4 , E4 = 8.7\u00b7 10- 5 ). Finally, \nin our example, estimating the audio sources with TDSEP and after this applying \nJADE to the orthogonal subspace, gives the optimal solution since it combines the \nsmall separation errors E3, E4 for TDSEP with the ability of JADE to separate \nthe uniformly distributed noise. \n\n3.3 Blockwise uncertainty estimates \n\nFor a longer time series it is not only important to know which ICA channels are \nreliable, but also to know whether different parts of a given time series are more \n(or less) reliable to separate than others. To demonstrate these effects, we mixed \ntwo audio sources (8kHz, lOs - 80000 data points) , where the mixtures are partly \ncorrupted by white gaussian noise. Reliability analysis is performed on windows of \nlength 1000, shifted in steps of 250; the resulting variance estimates are smoothed. \nFig.3 shows again that the uncertainty measure is nicely correlated with the true \nseparation error, furthermore the variance goes systematically up within the noisy \npart but also in other parts of the time series that do not seem to match the \nassumptions underlying the algorithm. 3 So our reliability estimates can eventually \n\nFigure 3: Upper panel: mixtures, partly corrupted by noise. Lower panel: the blockwise \nvariance estimate (solid line) vs the true separation error on this block (dotted line) . \n\nbe used to improve separation performance by removing all but the 'reliable' parts \nof the time series. For our example this reduces the overall separation error by 2 \norders of magnitude from 2.4.10- 2 to 1.7.10-4 . \nThis moving-window resampling can detect instabilities of the projections in two \ndifferent ways: Besides the resampling variance that can be calculated for each \nwindow, one can also calculate the change of the projection directions between two \nwindows. The later has already been used successfully by Makeig et. al. [10]. \n\n4 Assigning Meaning: Application to Biomedical Data \n\nWe now apply our reliability analysis to biomedical data that has been produced \nby an MEG experiment with acoustic stimulation. The stimulation was achieved \nby presenting alternating periods of music and silence, each of 30s length, to the \nsubjects right ear during 30 min. of total recording time (for details see [13]). The \nmeasured DC magnetic field values, sampled at a frequency of 0.4 Hz, gave a to(cid:173)\ntal number of 720 sample points for each of the 49 channels. While previously \n\n3For example, the peak in the last third of the time series can be traced back to the \n\nfact that the original time series are correlated in this region. \n\n\f[13] analysing the data, we found that many of the ICA components are seemingly \nmeaningless and it took some medical knowledge to find potential meaningful pro(cid:173)\njections for a later close inspection. However, our reliability assessment can also \nbe seen as indication for meaningful projections, i.e. meaningful components should \nhave low variance. In the experiment, BSS was performed on the 23 most powerful \nprincipal components using (a) higher order statistics (JADE) and (b) temporal \ndecorrelation (TDSEP, time lag 0 .. 50). The results in Fig.4 show that none of \n\n0.35 \n\n0.3 \n\n0.25 \n\n::J \n~ 0.2 \n\ni \ng 0.1 5 \n::J \n\n0.1 \n\n0.05 \n\nhigher order statistics (JADE) \n\n10 \n15 \nleA-Channel i \n\n20 \n\n0.35 \n\n0.3 \n\n0 .25 \n-\n::J \n~ 0.2 \n~ \ng 0.15 \n\n::J \n\n0 .1 \n\n0.05 \n\ntemporal decorrelation (TDSEP) \n\n,~ \n\n10 \n15 \nleA-Channel i \n\n20 \n\nFigure 4: Resampling on the biomedical data from MEG experiment shows: (a) no JADE \nprojection is reliable (has low uncertainty) (b) TDSEP is able to identify three sources \nwith low uncertainty. \n\nthe JADE-projections (left) have small variance whereas TDSEP (right) identifies \nthree sources with a good reliability. In fact , these three components have physical \nmeaning: while component 23 is an internal very low frequency signal (drift) that \nis always present in DC-measurements, component 22 turns out to be an artifact of \nthe measurement; interestingly component 6 shows a (noisy) rectangular waveform \nthat clearly displays the 1/308 on/off characteristics of the stimulus (correlation to \nstimulus 0.7; see Fig.5) . The clear dipole-structure of the spatial field pattern in \n\n~stimulUS \n\n0.5 \n\n~ \n~O \nIn \n-0.5 \n\nFigure 5: Spatial field pattern, frequency content and time course of TDSEP channel 6. \n\n1 2 34 5 \n\n6 \n\n7 \n\nt[min) \n\nFig.5 underlines the relevance of this projection. The components found by JADE \ndo not show such a clear structure and the strongest correlation of any component \nto the stimulus is about 0.3, which is of the same order of magnitude as the strongest \ncorrelated PCA-component before applying JADE. \n\n5 Discussion \n\nWe proposed a simple method to estimate the reliability of ICA projections based on \nres amp ling techniques. After showing that our technique approximates the separa(cid:173)\ntion error, several directions are open(ed) for applications. First, we may like to use \nit for model selection purposes to distinguish between algorithms or to chose appro(cid:173)\npriate hyperparameter values (possibly even component-wise). Second, variances \n\n\fcan be estimated on blocks of data and separation performance can be enhanced \nby using only low variance blocks where the model matches the data nicely. Finally \nreliability estimates can be used to find meaningful components. Here our assump(cid:173)\ntion is that the more meaningful a component is, the more stably we should be able \nto estimate it. In this sense artifacts appear of course also as meaningful, whereas \nnoisy directions are discarded easily, due to their high uncertainty. \nFuture research will focus on applying res amp ling techniques to other unsupervised \nlearning scenarios. We will also consider Bayesian modelings where often a variance \nestimate comes for free, along with the trained model. \n\nAcknowledgments K-R.M thanks Guido Nolte and the members of the Oberwolfach \nSeminar September 2000 in particular Lutz Dumbgen and Enno Mammen for helpful \ndiscussions and suggestions. K -R. M and A. Z. acknowledge partial funding by the EU \nproject (IST-1999-14190 - BLISS). We thank the Biomagnetism Group of the Physikalisch(cid:173)\nTechnische Bundesanstalt (PTB) for providing the MEG-DC data. \n\nReferences \n[1] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal \nseparation. In D .S. Touretzky, M.C. Mozer, and M.E . Hasselmo, editors, Advances \nin Neural Information Processing Systems (NIPS 95), volume 8, pages 882-893. The \nMIT Press, 1996. \n\n[2] A. J. Bell and T. J. Sejnowski. An information maximisation approach to blind \n\nseparation and blind deconvolution. N eural Computation, 7:1129- 1159, 1995. \n\n[3] R. Beran and M.S. Srivastava. Bootstrap tests and confidence regions for functions \n\nof a covariance matrix. Annals of Statistics, 13:95- 115, 1985. \n\n[4] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non Gaussian signals. IEEE \n\nProceedings-F, 140(6):362- 370, December 1994. \n\n[5] P. Comon. Independent component analysis, a new concept? \n\n36(3):287-314, 1994. \n\nSignal Processing, \n\n[6] G. Deco and D. Obradovic. An information-theoretic approach to neural computing. \n\nSpringer, New York, 1996. \n\n[7] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, \n\nfirst edition, 1993. \n\n[8] A. Hyviirinen, J. Karhunen , and E. Oja. Independent Component Analysis. Wiley, \n\n200l. \n\n[9] Ch. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm \n\nbased on neuromimetic architecture. Signal Processing, 24:1- 10, 1991. \n\n[10] S. Makeig, S. Enghoff, T.-P. Jung, and T. Sejnowski. Moving-window ICA decompo(cid:173)\nsition of EEG data reveals event-related changes in oscillatory brain activity. In Proc. \n2nd Int. Workshop on Independent Component Analysis and Blind Source Separation \n(ICA '2000), pages 627- 632, Helsinki, Finland, 2000. \n\n[11] F . Meinecke, A. Ziehe, M. Kawanabe, and K-R. Muller. Assessing reliability of ica \n\nprojections - a resampling approach. In ICA '01. T.-W. Lee, Ed., 200l. \n\n[12] J. Shao and D. Th. The Jackknife and Bootstrap. Springer, New York, 1995. \n[13] G. Wubbeler, A. Ziehe, B.-M. Mackert, K-R. Muller, L. Trahms, and G. Curio. \nIndependent component analysis of non-invasively recorded cortical magnetic dc-fields \nin humans. IEEE Transactions on Biomedical Engineering, 47(5):594-599, 2000. \n\n[14] A. Ziehe and K-R. Muller. TDSEP - an efficient algorithm for blind separation using \ntime structure. In L. Niklasson, M. Boden, and T. Ziemke, editors, Proc. Int. Conf. \non Artificial Neural Networks (ICANN'9S) , pages 675 - 680, Skiivde, Sweden, 1998. \nSpringer Verlag. \n\n\f", "award": [], "sourceid": 2086, "authors": [{"given_name": "Frank", "family_name": "Meinecke", "institution": null}, {"given_name": "Andreas", "family_name": "Ziehe", "institution": null}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}