{"title": "Ensemble Learning and Linear Response Theory for ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 542, "page_last": 548, "abstract": null, "full_text": "Ensemble Learning and Linear Response Theory \n\nfor leA \n\nPedro A.d.F.R. Hfljen-Sflrensenl , Ole Winther2 , Lars Kai Hansen l \n\n1 Department of Mathematical Modelling, Technical University of Denmark B321 \n\nDK-2800 Lyngby, Denmark, ph s , l k h a n s en @imrn. d tu. dk \n\n2Theoretical Physics, Lund University, SOlvegatan 14 A \n\nS-223 62 Lund, Sweden, winther@ nimi s .thep.lu. se \n\nAbstract \n\nWe propose a general Bayesian framework for performing independent \ncomponent analysis (leA) which relies on ensemble learning and lin(cid:173)\near response theory known from statistical physics. We apply it to both \ndiscrete and continuous sources. For the continuous source the underde(cid:173)\ntermined (overcomplete) case is studied. The naive mean-field approach \nfails in this case whereas linear response theory-which gives an improved \nestimate of covariances-is very efficient. The examples given are for \nsources without temporal correlations. However, this derivation can eas(cid:173)\nily be extended to treat temporal correlations. Finally, the framework \noffers a simple way of generating new leA algorithms without needing \nto define the prior distribution of the sources explicitly. \n\n1 Introduction \n\nReconstruction of statistically independent source signals from linear mixtures is an active \nresearch field. For historical background and early references see e.g. [I]. The source \nseparation problem has a Bayesian formulation, see e.g., [2, 3] for which there has been \nsome recent progress based on ensemble learning [4]. \n\nIn the Bayesian framework, the covariances of the sources are needed in order to estimate \nthe mixing matrix and the noise level. Unfortunately, ensemble learning using factorized \ntrial distributions only treats self-interactions correctly and trivially predicts: (SiSi)(cid:173)\n(Si}(Sj) = 0 for i -I j. This naive mean-field (NMF) approximation first introduced in \nthe neural computing context by Ref. [5] for Boltzmann machine learning may completely \nfail in some cases [6]. Recently, Kappen and Rodriguez [6] introduced an efficient learning \nalgorithm for Boltzmann Machines based on linear response (LR) theory. LR theory gives \na recipe for computing an improved approximation to the covariances directly from the \nsolution to the NMF equations [7]. \n\nEnsemble learning has been applied in many contexts within neural computation, e.g. for \nsigmoid belief networks [8], where advanced mean field methods such as LR theory or \nTAP [9] may also be applicable. In this paper, we show how LR theory can be applied \nto independent component analysis (leA). The performance of this approach is compared \nto the NMF approach. We observe that NMF may fail for high noise levels and binary \n\n\fsources and for the underdetermined continuous case. In these cases the NMF approach \nignores one of the sources and consequently overestimates the noise. The LR approach on \nthe other hand succeeds in all cases studied. \n\nThe derivation of the mean-field equations are kept completely general and are thus valid \nfor a general source prior (without temporal correlations). The final eqs. show that the \nmean-field framework may be used to propose ICA algorithms for which the source prior \nis only defined implicitly. \n\n2 Probabilistic leA \n\nFollowing Ref. [10], we consider a collection of N temporal measurements, X = {Xdt}, \nwhere Xdt denotes the measurement at the dth sensor at time t. Similarly, let S = {Smd \ndenote a collection of M mutually independent sources where Sm. is the mth source which \nin general may have temporal correlations. The measured signals X are assumed to be an \ninstantaneous linear mixing of the sources corrupted with additive Gaussian noise r , that \nis, \n\n(1) \nwhere A is the mixing matrix. Furthermore, to simplify this exposition the noise is assumed \nto be iid Gaussian with variance a 2 . The likelihood of the parameters is then given by, \n\nX=As+r , \n\nP(XIA, ( 2 ) = ! dSP(XIA, a 2 , S) P(S) , \n\n(2) \n\nwhere P(S) is the prior on the sources which might include temporal correlations. We \nwill, however, throughout this paper assume that the sources are temporally uncorrelated. \nWe choose to estimate the mixing matrix A and noise level a 2 by Maximum Likelihood \n(ML-II). The saddlepoint of P(XIA, ( 2 ) is attained at, \n\n810gP~~IA,(2) = 0 \n\nA = X(S)T(SST)-l \n\n810gP~~IA,(2) = 0 \n\na 2 = D~(Tr(X - ASf(X - AS)) , \nwhere (.) denotes an average over the posterior and D is the number of sensors. \n\n(3) \n\n(4) \n\n3 Mean field theory \n\nFirst, we derive mean field equations using ensemble learning. Secondly, using linear \nresponse theory, we obtain improved estimates of the off-diagonal terms of (SST) which \nare needed for estimating A and a 2 . The following derivation is performed for an arbitrary \nsource prior. \n\n3.1 Ensemble learning \n\nWe adopt a standard ensemble learning approach and approximate \n\nP(S IX A \n\n2) = P(XIA, a 2 , S)P(S) \n\n(5) \nin a family of product distributions Q(S) = TImt Q(Smt) . It has been shown in Ref. [11] \nthat for a Gaussian P(XIA, a 2 , S), the optimal choice of Q(Smt) is given by a Gaussian \ntimes the prior: \n\nP(XIA,a 2 ) \n\n\"a \n\n(6) \n\n\fIn the following, it is convenient to use standard physics notation to keep everything as \ngeneral as possible. We therefore parameterize the Gaussian as, \n\nP(XIA, a 2, S) = P(XIJ, h, S) = Ce~ Tr(ST JS)+Tr(hTS) , \n\n(7) \n\nwhere J = _AT AI a 2 is the M x M interaction matrix and h = A TXI a 2 has the same \ndimensions as the source matrix S. Note that h acts as an external field from which we can \nobtain all moments of the sources. This is a property that we will make use of in the next \nsection when we derive the linear response corrections. The Kullback-Leibler divergence \nbetween the optimal product distribution Q (S) and the true source posterior is given by \n\nKL = ! dSQ(S)In P(SI~:~,a2) = InP(XIA,a2) -lnP(XIA,a2) \nInP(XIA,a2) = 2)Og! d8P(8)e~>'~tS2+Y~tS + ~ 2: (Jmm - Amt)(8~t) \n\n(8) \n\nmt \n1 +2 Tr(ST)(J - diag(J)(S) + Tr(h - \"If (S) + In C , \n\nmt \n\n(9) \n\nwhere P(XIA, a 2) is the naive mean field approximation to the Likelihood and diag(J) is \nthe diagonal matrix of J. The saddlepoints define the mean field equations; \n\n\"I = h + (J - diag(J))(S) \n\noKL \no(S) = 0 \noKL \n0(8;'t) = 0 \n\nThe remaining two equations depend explicitly on the source prior, P(S); \n\noK L = 0 \n01mt \n\noKL = 0 \nOAmt \n\n: \n\n(8mt ) = _O_IOg! d8mtP(8mt)e~>'~ts;'t+Y~tS~t \n\n01mt \n\n== fbmt, Amt) \n\\ = 2 -0 -10g !d8 P(8 )e~>'~tS;'t+')'~tS~t \n\nOAmt \n\nmt \n\nmt \n\n(82 \n\nmtl \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n. \n\nIn section 4, we calculate fbmt' Amt) for some of the prior distributions found in the leA \nliterature. \n\n3.2 Linear response theory \n\nAs mentioned already, h acts as an external field. This makes it possible to calculate the \nmeans and covariances as derivatives of log P(XIJ, h), i.e. \n\\ = ologP(XIJ, h) \n\n(14) \n\n(15) \n\n(16) \n\n(8 \n\nmtl \n\nohmt \n\ntt' \n\n\\ \nXmm, = mt m't'l -\n\n(8 8 \n\n-\n\n(8 \n\n\\(8 \n\nmtl m't'l -\n\n\\ _ 0 2 log P(XIJ, h) _ 0(8mt ) \nU m't' \n\n!lh \nU m't' U mt \n\n- ~h . \n\n!lh \n\nTo derive an equation for X~m\" we use eqs. (10), (11) and (12) to get \n\nXmm, = \ntt' \n\nOf(1mt, Amt) 01mt \nohm't' \n\n= \n\nOf(1mt, Amt) ( \"J tt \n\nL...J \n\nmm\"Xm\"m' + Umm' Ott'\u00b7 \n\n01mt \n\nm\",m\"\u00a5:m \n\n~) ~ \n\n\f2 \n\nX'\" a \n\n-1 \n\n-2 \n-2 \n\n2 \n\nX'\" a \n\n-1 \n\n-2 \n-2 \n\n2 \n\n..: ..\u2022... \n.. \n.. &' .. \"''Ij.;~ . \n.. ~ . \n~r .. \n~, \n.. 01#. ~' ~\" \u2022 \u2022 \u2022 \n,oJ \u2022 . . . eA_ \n.. \n....... \n\na \nx, \n\n2 \n\nx'\" a \n\n-1 \n\na \nX, \n\n2 \n\n-2 \n-2 \n\na \nx, \n\n2 \n\nFigure 1: Binary source recovery for low noise level (M = 2, D = 2). Shows from left \nto right: +/- the column vectors of; the true A (with the observations superimposed); the \nestimated A (NMF); estimated A (LR). \n\n0.4 \n\n0.5 \n\nr!J.. \n\n8. \n\u2022\u2022 ,\n\n..... _ ... \n\n,, \n~ \u2022\u2022 \n\n\"\n\n\u2022\u2022 Ii> \n\n0.5 \n\n.~ \n\n-0.5 \n\n-0.5 \n\n-0.5 \n\na \nx, \n\n0.5 \n\n-0 .5 \n\na \nx, \n\n0.5 \n\n0'3~ \n\ni'l \n.~ 0.2 \n'\" \n> \n\n\\ \n\n, \n\n0.1 \n\n\\ \n\n,_,_,.l.~ \n\nO'------~--~ \n40 \n\n20 \n\na \n\niteration \n\nFigure 2: Binary source recovery for low noise level (M = 2, D = 2), Shows the dynamics \nof the fix-point iterations. From left to right; +/- the column vectors of A (NMF); +/- the \ncolumn vectors of A (LR); variance (72 (solid:NMF, dashed:LR, thick dash-dotted: the true \nempirical noise variance). \n\nWe now see that the x-matrix factorizes in time X~ml = ott' X~ml. This is a direct conse(cid:173)\nquence of the fact that the model has no temporal correlations. The above equation is linear \nand may straightforwardly be solved to yield \n\nX~ml = [(At - J)-l]mm' , \n\n(17) \n\nwhere we have defined the diagonal matrix \n\nAt = diag (8fh~'Alt) + J11 , ... , 8fhM~,AM.) + JMM) . \n\n8\"(lt \n\n8\"(Mt \n\nAt this point is appropriate to explain why linear response theory is more precise than us(cid:173)\ning the factorized distribution which predicts X~ml = 0 for non-diagonal terms. Here, \nwe give an argument that can be found in Parisi's book on statistical field theory [7] : \nLet us assume that the approximate and exact distribution is close in some sense, i,e. \nQ(S) - P(SIX, A, (72) = c then (SmtSm1t)ex = (SmtSm1t)ap + O(c). Mean field the(cid:173)\nory gives a lower bound on the log-Likelihood since K L , eq. (8) is non-negaitive. Conse(cid:173)\nquently, the linear term vanishes in the expansion of the log-Likelihood: log P(XIA, (72) = \nlog P(XIA, (72) + O(c2 ) . It is therefore more precise to obtain moments of the variables \nthrough derivatives of the approximate log-Likelihood, i,e. by linear response. \n\nA final remark to complete the picture: if diag(J) in equation eq. (10) is exchanged with \nAt = diag(Alt, ... , AMt) and likewise in the definition of At above we get TAP equations \n[9], The TAP equation for A \n\nis xt = 8fh=t ,A=,) = [(At - J)-l] \n\n. \nmm \n\nmt \n\nmm \n\n8\"(=t \n\n\f2 \n\n1 \n\n:0, ..... : .:.~: : .. \nX'\" a \u00b7:~t.~'~f \n\u2022 \u2022 \n.. ~-. .:' l.' :::: \n\n....... ~,. \n\n.. .. \n\n.. \"C.. \n\n-1 \n\n2 \n\nX'\" a -\n\n-1 \n\n-2 \n-2 \n\na \nx, \n\n2 \n\n-2 \n-2 \n\na \nX, \n\n2 \n\nX'\" a \n\n-1 \n\n-2 \n-2 \n\n2 \n\na \nx, \n\n2 \n\nFigure 3: Binary source recovery for high noise level (M = 2, D = 2). Shows from left \nto right: +/- the column vectors of; the true A (with the observations superimposed); the \nestimated A (NMF); estimated A (LR). \n\n0 .5 \n\nxC\\! a eo ................ ..... ~ x N \n\n)( \n\n)( \n\n0 .5 \n\n0 \n\n-- \n\n, \n1 -, -,_._._._. -,_ .. \n\n.~ 1.5 \"'-------\n\n0.5 \n0 \n\n1000 \n\n2000 \n\niteration \n\nFigure 6: Overcomplete continuous source recovery with M = 3 and D = 2. Same plot as \nin figure 2. Note that the initial iteration step for A is very large. \n\n4.2 Continuous Source \n\nTo give a tractable example which illustrates the improvement by LR, we consider the \nGaussian prior P(Smt) ex: exp( -o.S~t!2) (not suitable for source separation). This leads \nto fbmt, Amt) = 'Ymt/(o. - Amt). Since we have a factorized distribution, ensemble \n(Smt}(Sm't') = 8mm,8tt' (a. - Amt)-l = 8mm,8tt' (a. -\nlearning predicts (SmtSm't') -\nJmm)-l, where the second equality follows from eq. (11). Linear response eq. (17) gives \n(Smt}(Sm't') = 8tt' [(0.1 -J)-l]mm' which is identical with the exact \n(SmtSm't') -\nresult obtained by direct integration. \n\nFor the popular choice of prior P(Smt) = \n[1], it is not possible to derive \nfbmt. Amt) analytically. However, fbmt. Amt) can be calculated analytically for the \nvery similar Laplace distribution. Both these examples have positive kurtosis. \nMean field equations for negative kurtosis can be obtained using the prior P(Smt) ex:= \nexp( -(Smt - 1-\u00a3)2/2) + exp( -(Smt + 1-\u00a3)2/2) [1] leading to \n\n7r cos \n\ntnt \n\n~ s \n\nFigure 5 and 6 show simulations using this source prior with 1-\u00a3 = 1 in an overcomplete \nsetting with D = 2 and M = 3. Note that 1-\u00a3 = 1 yields a unimodal source distribution \nand hence qualitatively different from the bimodal prior considered in the binary case. In \nthe overcomplete setting the NMF approach fails to recover the true sources. See [13] for \nfurther discussion of the overcomplete case. \n\n\f5 Conclusion \n\nWe have presented a general ICA mean field framework based upon ensemble learning \nand linear response theory. The naive mean-field approach (pure ensemble learning) fails \nin some cases and we speculate that it is incapable of handling the overcomplete case \n(more sources than sensors). Linear response theory, on the other hand, succeeds in all the \nexamples studied. \n\nThere are two directions in which we plan to extend this work: (1) to sources with temporal \ncorrelations and (2) to source models defined not by a parametric source prior, but directly \nin terms of the function j, which defines the mean field equations. Starting directly from \nthe j-function makes it possible to test a whole range of implicitly defined source priors. \nA detailed analysis of a large selection of constrained and unconstrained source priors as \nwell as comparisons of LR and the TAP approach can be found in [14]. \n\nAcknowledgments \n\nPHS wishes to thank Mike Jordan for stimulating discussions on the mean field and vari(cid:173)\national methods. This research is supported by the Swedish Foundation for Strategic Re(cid:173)\nsearch as well as the Danish Research Councils through the Computational Neural Network \nCenter (CONNECT) and the THOR Center for Neuroinformatics. \n\nReferences \n\n[1] T.-W. Lee: Independent Component Analysis, Kluwer Academic Publishers, Boston (1998). \n[2] A. Belouchrani and J.-F. Cardoso: Maximum Likelihood Source Separation by the Expectation(cid:173)\n\nMaximization Technique: Deterministic and Stochastic Implementation In Proc. NOLTA, 49-53 \n(1995). \n\n[3] D. MacKay: Maximum Likelihood and Covariant Algorithms for Independent Components \n\nAnalysis. \"Draft 3.7\" (1996). \n\n[4] H. Lappalainen and J.W. Miskin: Ensemble Learning, Advances in Independent Component \n\nAnalysis, Ed. M. Girolami, In press (2000). \n\n[5] C. Peterson and J. Anderson: A Mean Field Theory Learning Algorithm for Neural Networks, \n\nComplex Systems 1, 995- 1019 (1987). \n\n[6] H. J. Kappen and F. B. Rodriguez: Efficient Learning in Boltzmann Machines Using Linear \n\nResponse Theory, Neural Computation 10,1137-1156 (1998). \n\n[7] G. Parisi: Statistical Field Theory, Addison Wesley, Reading Massachusetts (1988). \n[8] L. K. Saul, T. Jaakkola and M. 1. Jordan: Mean Field Theory of Sigmoid Belief Networks, \n\nJournal of Artificial Intelligence Research 4, 61- 76 (1996). \n\n[9] M. Opper and O. Winther: Tractable Approximations for Probabilistic Models: The Adaptive \n\nTAP Mean Field Approach, Submitted to Phys. Rev. Lett. (2000). \n\n[10] L. K. Hansen: Blind Separation of Noisy Image Mixtures, Advances in Independent Component \n\nAnalysis, Ed. M. Girolami, In press (2000). \n\n[11] L. Csat6, E. Fokoue, M. Opper, B. Schottky and O. Winther: Efficient Approaches to Gaussian \nProcess Classification, in Advances in Neural Information Processing Systems 12 (NIPS'99), \nEds. S. A. Solla, T. K. Leen, and K.-R. Muller, MIT Press (2000). \n\n[12] A.-J. van der Veen: Analytical Method for Blind Binary Signal Separation IEEE Trans. on \n\nSignal Processing 45(4) 1078- 1082 (1997). \n\n[13] M. S. Lewicki and T. J. Sejnowski: Learning Overcomplete Representations, Neural Computa(cid:173)\n\ntion 12, 337-365 (2000). \n\n[14] P. A. d. F. R. H0jen-S0rensen, O. Winther and L. K. Hansen: Mean Field Approaches to Inde(cid:173)\n\npendent Component Analysis, In preparation. \n\n\f", "award": [], "sourceid": 1806, "authors": [{"given_name": "Pedro", "family_name": "H\u00f8jen-S\u00f8rensen", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}, {"given_name": "Lars", "family_name": "Hansen", "institution": null}]}