{"title": "A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 2872, "page_last": 2880, "abstract": "Independent Component Analysis (ICA) is a popular model for blind signal separation. The ICA model assumes that a number of independent source signals are linearly mixed to form the observed signals. We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise. The main technical innovation of the algorithm is to use a fixed point iteration in a pseudo-Euclidean (indefinite \u201cinner product\u201d) space. The use of this indefinite \u201cinner product\u201d resolves technical issues common to several existing algorithms for noisy ICA. This leads to an algorithm which is conceptually simple, efficient and accurate in testing.Our second contribution is combining PEGI with the analysis of objectives for optimal recovery in the noisy ICA model. It has been observed that the direct approach of demixing with the inverse of the mixing matrix is suboptimal for signal recovery in terms of the natural Signal to Interference plus Noise Ratio (SINR) criterion. There have been several partial solutions proposed in the ICA literature. It turns out that any solution to the mixing matrix reconstruction problem can be used to construct an SINR-optimal ICA demixing, despite the fact that SINR itself cannot be computed from data. That allows us to obtain a practical and provably SINR-optimal recovery method for ICA with arbitrary Gaussian noise.", "full_text": "A Pseudo-Euclidean Iteration for Optimal Recovery\n\nin Noisy ICA\n\nJames Voss\n\nThe Ohio State University\nvossj@cse.ohio-state.edu\n\nMikhail Belkin\n\nThe Ohio State University\nmbelkin@cse.ohio-state.edu\n\nLuis Rademacher\n\nThe Ohio State University\n\nlrademac@cse.ohio-state.edu\n\nAbstract\n\nIndependent Component Analysis (ICA) is a popular model for blind signal sepa-\nration. The ICA model assumes that a number of independent source signals are\nlinearly mixed to form the observed signals. We propose a new algorithm, PEGI\n(for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA\nwith Gaussian noise. The main technical innovation of the algorithm is to use a\n\ufb01xed point iteration in a pseudo-Euclidean (inde\ufb01nite \u201cinner product\u201d) space. The\nuse of this inde\ufb01nite \u201cinner product\u201d resolves technical issues common to several\nexisting algorithms for noisy ICA. This leads to an algorithm which is conceptually\nsimple, ef\ufb01cient and accurate in testing.\nOur second contribution is combining PEGI with the analysis of objectives for\noptimal recovery in the noisy ICA model. It has been observed that the direct\napproach of demixing with the inverse of the mixing matrix is suboptimal for signal\nrecovery in terms of the natural Signal to Interference plus Noise Ratio (SINR)\ncriterion. There have been several partial solutions proposed in the ICA literature.\nIt turns out that any solution to the mixing matrix reconstruction problem can be\nused to construct an SINR-optimal ICA demixing, despite the fact that SINR itself\ncannot be computed from data. That allows us to obtain a practical and provably\nSINR-optimal recovery method for ICA with arbitrary Gaussian noise.\n\nIntroduction\n\nX = (cid:80)m\n\n1\nIndependent Component Analysis refers to a class of methods aiming at recovering statistically\nindependent signals by observing their unknown linear combination. There is an extensive literature\non this and a number of related problems [7].\nIn the ICA model, we observe n-dimensional realizations x(1), . . . , x(N ) of a latent variable model\nk=1 SkAk = AS where Ak denotes the kth column of the n \u00d7 m mixing matrix A and\nS = (S1, . . . , Sm)T is the unseen latent random vector of \u201csignals\u201d. It is assumed that S1, . . . , Sm\nare independent and non-Gaussian. The source signals and entries of A may be either real- or\ncomplex-valued. For simplicity, we will assume throughout that S has zero mean, as this may be\nachieved in practice by centering the observed data.\nMany ICA algorithms use the preprocessing \u201cwhitening\u201d step whose goal is to orthogonalize the\nindependent components. In the noiseless, case this is commonly done by computing the square\nroot of the covariance matrix of X. Consider now the noisy ICA model X = AS + \u03b7 with additive\n0-mean noise \u03b7 independent of S. It turns out that the introduction of noise makes accurate recovery\nof the signals signi\ufb01cantly more involved. Speci\ufb01cally, whitening using the covariance matrix does\nnot work in the noisy ICA model as the covariance matrix combines both signal and noise. For\nthe case when the noise is Gaussian, matrices constructed from higher order statistics (speci\ufb01cally,\ncumulants) can be used instead of the covariance matrix. However, these matrices are not in general\npositive de\ufb01nite and thus the square root cannot always be extracted. This limits the applicability of\n\n1\n\n\fseveral previous methods, such as [1, 2, 9]. The GI-ICA algorithm proposed in [21] addresses this\nissue by using a complicated quasi-orthogonalization step followed by an iterative method.\nIn this paper (section 2), we develop a simple and practical one-step algorithm, PEGI (for pseudo-\nEuclidean Gradient Iteration) for provably recovering A (up to the unavoidable ambiguities of the\nmodel) in the case when the noise is Gaussian (with an arbitrary, unknown covariance matrix). The\nmain technical innovation of our approach is to formulate the recovery problem as a \ufb01xed point\nmethod in an inde\ufb01nite (pseudo-Euclidean) \u201cinner product\u201d space.\nThe second contribution of the paper is combining PEGI with the analysis of objectives for optimal\nrecovery in the noisy ICA model. In most applications of ICA (e.g., speech separation [18], MEG/EEG\nartifact removal [20] and others) one cares about recovering the signals s(1), . . . , s(N ). This is known\nas the source recovery problem. This is typically done by \ufb01rst recovering the matrix A (up to an\nappropriate scaling of the column directions).\nAt \ufb01rst, source recovery and recovering the mixing matrix A appear to be essentially equivalent. In\nthe noiseless ICA model, if A in invertible1 then s(t) = A\u22121x(t) recovers the sources. On the other\nhand, in the noisy model, the exact recovery of the latent sources s(t) becomes impossible even if A\nis known exactly. Part of the \u201cnoise\u201d can be incorporated into the \u201csignal\u201d preserving the form of the\nmodel. Even worse, neither A nor S are de\ufb01ned uniquely as there is an inherent ambiguity in the\nsetting. There could be many equivalent decompositions of the observed signal as X = A(cid:48)S(cid:48) + \u03b7(cid:48)\n(see the discussion in section 3).\nWe consider recovered signals of the form \u02c6S(B) := BX for a choice of m \u00d7 n demixing matrix B.\nSignal recovery is considered optimal if the coordinates of \u02c6S(B) = ( \u02c6S1(B), . . . , \u02c6Sm(B)) maximize\nSignal to Interference plus Noise Ratio (SINR) within any \ufb01xed model X = AS + \u03b7. Note that\nthe value of SINR depends on the decomposition of the observed data into \u201cnoise\u201d and \u201csignal\u201d:\nX = A(cid:48)S(cid:48) + \u03b7(cid:48).\nSurprisingly, the SINR optimal demixing matrix does not depend on the decomposition of data into\nsignal plus noise. As such, SINR optimal ICA recovery is well de\ufb01ned given access to data despite\nthe inherent ambiguity in the model. Further, it will be seen that the SINR optimal demixing can be\nconstructed from cov(X) and the directions of the columns of A (which are also well-de\ufb01ned across\nsignal/noise decompositions).\nOur SINR-optimal demixing approach combined with the PEGI algorithm provides a complete\nSINR-optimal recovery algorithm in the ICA model with arbitrary Gaussian noise. We note that the\nICA papers of which we are aware that discuss optimal demixing do not observe that SINR optimal\ndemixing is invariant to the choice of signal/noise decomposition. Instead, they propose more limited\nstrategies for improving the demixing quality within a \ufb01xed ICA model. For instance, Joho et al.\n[14] show how SINR-optimal demixing can be approximated with extra sensors when assuming a\nwhite additive noise, and Koldovsk`y and Tichavsk`y [16] discuss how to achieve asymptotically low\nbias ICA demixing assuming white noise within a \ufb01xed ICA model. However, the invariance of the\nSINR-optimal demixing matrix appears in the array sensor systems literature [6].\nFinally, in section 4, we demonstrate experimentally that our proposed algorithm for ICA outperforms\nexisting practical algorithms at the task of noisy signal recovery, including those speci\ufb01cally designed\nfor beamforming, when given suf\ufb01ciently many samples. Moreover, most existing practical algorithms\nfor noisy source recovery have a bias and cannot recover the optimal demixing matrix even with\nin\ufb01nite samples. We also show that PEGI requires signi\ufb01cantly fewer samples than GI-ICA [21] to\nperform ICA accurately.\n\n1.1 The Indeterminacies of ICA\nNotation: We use M\u2217 to denote the entry-wise complex conjugate of a matrix M, M T to denote its\ntranspose, M H to denote its conjugate transpose, and M\u2020 to denote its Moore-Penrose pseudoinverse.\nBefore proceeding with our results, we discuss the somewhat subtle issue of indeterminacies in ICA.\nThese ambiguities arise from the fact that the observed X may have multiple decompositions into\nICA models X = AS + \u03b7 and X = A(cid:48)S(cid:48) + \u03b7(cid:48).\n\n1A\u22121 can be replaced with A\u2020 (A\u2019s pseudoinverse) in the discussion below for over-determined ICA.\n\n2\n\n\fICA models X =(cid:80)m\n\nk=1 SkAk and X =(cid:80)m\n\nNoise-free ICA has two natural indeterminacies. For any nonzero constant \u03b1, the contribution of\nthe kth component AkSk to the model can equivalently be obtained by replacing Ak with \u03b1Ak and\n\u03b1 Sk. To lessen this scaling indeterminacy, we use the convention2 that\nSk with the rescaled signal 1\ncov(S) = I throughout this paper. As such, each source Sk (or equivalently each Ak) is de\ufb01ned up\nto a choice of sign (a unit modulus factor in the complex case). In addition, there is an ambiguity\nin the order of the latent signals. For any permutation \u03c0 of [m] (where [m] := {1, . . . , m}), the\nk=1 S\u03c0(k)A\u03c0(k) are indistinguishable. In the noise free\nsetting, A is said to be recovered if we recover each column of A up to a choice of sign (or up to a unit\nmodulus factor in the complex case) and an unknown permutation. As the sources S1, . . . , Sm are\nonly de\ufb01ned up to the same indeterminacies, inverting the recovered matrix \u02dcA to obtain a demixing\nmatrix works for signal recovery.\nIn the noisy ICA setting, there is an additional indeterminacy in the de\ufb01nition of the sources. Consider\na 0-mean axis-aligned Gaussian random vector \u03be. Then, the noisy ICA model X = A(S + \u03be) + \u03b7 in\nwhich \u03be is considered part of the latent source signal S(cid:48) = S + \u03be, and the model X = AS + (A\u03be + \u03b7)\nin which \u03be is part of the noise are indistinguishable. In particular, the latent source S and its covariance\nare ill-de\ufb01ned. Due to this extra indeterminacy, the lengths of the columns of A no longer have a fully\nde\ufb01ned meaning even when we assume cov(S) = I. In the noisy setting, A is said to be recovered if\nwe obtain the columns of A up to non-zero scalar multiplicative factors and an arbitrary permutation.\nThe last indeterminacy is the most troubling as it suggests that the power of each source signal is itself\nill-de\ufb01ned in the noisy setting. Despite this indeterminacy, it is possible to perform an SINR-optimal\ndemixing without additional assumptions about what portion of the signal is source and what portion\nis noise. In section 3, we will see that SINR-optimal source recovery takes on a simple form: Given\nany solution \u02dcA which recovers A up to the inherent ambiguities of noisy ICA, then \u02dcAH cov(X)\u2020 is\nan SINR-optimal demixing matrix.\n\n1.2 Related Work and Contributions\nIndependent Component Analysis is probably the most used model for Blind Signal Separation.\nIt has seen numerous applications and has generated a vast literature, including in the noisy and\nunderdetermined settings. We refer the reader to the books [7, 13] for a broad overview of the subject.\nIt was observed early on by Cardoso [4] that ICA algorithms based soley on higher order cumulant\nstatistics are invariant to additive Gaussian noise. This observation has allowed the creation of many\nalgorithms for recovering the ICA mixing matrix in the noisy and often underdetermined settings.\nDespite the signi\ufb01cant work on noisy ICA algorithms, they remain less ef\ufb01cient, more specialized, or\nless practical than the most popular noise free ICA algorithms.\nResearch on cumulant-based noisy ICA can largely be split into several lines of work which we only\nhighlight here. Some algorithms such as FOOBI [4] and BIOME [1] directly use the tensor structure\nof higher order cumulants. In another line of work, De Lathauwer et al. [8] and Yeredor [23] have\nsuggested algorithms which jointly diagonalize cumulant matrices in a manner reminiscent of the\nnoise-free JADE algorithm [3]. In addition, Yeredor [22] and Goyal et al. [11] have proposed ICA\nalgorithms based on random directional derivatives of the second characteristic function.\nEach line of work has its advantages and disadvantages. The joint diagonalization algorithms and\nthe tensor based algorithms tend to be practical in the sense that they use redundant cumulant infor-\nmation in order to achieve more accurate results. However, they have a higher memory complexity\nthan popular noise free ICA algorithms such as FastICA [12]. While the tensor methods (FOOBI\nand BIOME) can be used when there are more sources than the dimensionality of the space (the\nunderdetermined ICA setting), they require all the latent source signals to have positive order 2k\ncumulants (k \u2265 2, a predetermined \ufb01xed integer) as they rely on taking a matrix square root. Finally,\nthe methods based on random directional derivatives of the second characteristic function rely heavily\nupon randomness in a manner not required by the most popular noise free ICA algorithms.\nWe continue a line of research started by Arora et al. [2] and Voss et al. [21] on fully determined noisy\nICA which addresses some of these practical issues by using a de\ufb02ationary approach reminiscent\nof FastICA. Their algorithms thus have lower memory complexity and are more scalable to high\ndimensional data than the joint diagonalization and tensor methods. However, both works require\n\n2Alternatively, one may place the scaling information in the signals by setting (cid:107)Ak(cid:107) = 1 for each k.\n\n3\n\n\fa preprocessing step (quasi-orthogonalization) to orthogonalize the latent signals which is based\non taking a matrix square root. Arora et al. [2] require each latent signal to have positive fourth\ncumulant in order to carry out this preprocessing step. In contrast, Voss et al. [21] are able to\nperform quasi-orthogonalization with source signals of mixed sign fourth cumulants; but their quase-\northogonalization step is more complicated and can run into numerical issues under sampling error.\nWe demonstrate that quasi-orthogonalization is unnecessary. We introduce the PEGI algorithm to\nwork within a (not necessarily positive de\ufb01nite) inner product space instead. Experimentally, this\nleads to improved demixing performance. In addition, we handle the case of complex signals.\nFinally, another line of work attempts to perform SINR-optimal source recovery in the noisy ICA\nsetting. It was noted by Koldovsk`y and Tichavsk`y [15] that for noisy ICA, traditional ICA algorithms\nsuch as FastICA and JADE actually outperform algorithms which \ufb01rst recover A in the noisy setting\nand then use the resulting approximation of A\u2020 to perform demixing. It was further observed that\nA\u2020 is not the optimal demixing matrix for source recovery. Later, Koldovsk`y and Tichavsk`y [17]\nproposed an algorithm based on FastICA which performs a low SINR-bias beamforming.\n\n2 Pseudo-Euclidean Gradient Iteration ICA\nIn this section, we introduce the PEGI algorithm for recovering A in the \u201cfully determined\u201d noisy\nICA setting where m \u2264 n. PEGI relies on the idea of Gradient Iteration introduced Voss et al. [21].\nHowever, unlike GI-ICA Voss et al. [21], PEGI does not require the source signals to be orthogonal-\nized. As such, PEGI does not require the complicated quasi-orthogonalization preprocessing step of\nGI-ICA which can be inaccurate to compute in practice. We sketch the Gradient Iteration algorithm\nin Section 2.1, and then introduce PEGI in Section 2.2. For simplicity, we limit this discussion to\nthe case of real-valued signals. A mild variation of our PEGI algorithm works for complex-valued\nsignals, and its construction is provided in the supplementary material.\nIn this section we assume a noisy ICA model X = AS + \u03b7 such that \u03b7 is arbitrary Gaussian and\nindependent of S. We also assume that m \u2264 n, that m is known, and that the columns of A are\nlinearly independent.\n\n2.1 Gradient Iteration with Orthogonality\nThe gradient iteration relies on the properties of cumulants. We will focus on the fourth cumulant,\nthough similar constructions may be given using other even order cumulants of higher order. For\na zero-mean random variable X, the fourth order cumulant may be de\ufb01ned as \u03ba4(X) := E[X 4] \u2212\n3E[X 2]2 [see 7, Chapter 5, Section 1.2]. Higher order cumulants have nice algebraic properties\nwhich make them useful for ICA. In particular, \u03ba4 has the following properties: (1) (Independence) If\nX and Y are independent, then \u03ba4(X + Y ) = \u03ba4(X) + \u03ba4(Y ). (2) (Homogeneity) If \u03b1 is a scalar,\nthen \u03ba4(\u03b1X) = \u03b14\u03ba4(X). (3) (Vanishing Gaussians) If X is normally distributed then \u03ba4(X) = 0.\nWe consider the following function de\ufb01ned on the unit sphere: f (u) := \u03ba4((cid:104)X, u(cid:105)). Expanding f (u)\nusing the above properties we obtain:\n\nk=1\n\nk = AD(u)AT\n\n(2)\nwhere D(u) is a diagonal matrix with entries D(u)kk = 12(cid:104)Ak, u(cid:105)2\u03ba4(Sk). We also note that f (u),\n\u2207f (u), and Hf (u) have natural sample estimates (see [21]).\nVoss et al. [21] introduced GI-ICA as a \ufb01xed point algorithm under the assumption that the\ncolumns of A are orthogonal but not necessarily unit vectors. The main idea is that the update\nu \u2190 \u2207f (u)/(cid:107)\u2207f (u)(cid:107) is a form of a generalized power iteration. From equation (1), each Ak may\nbe considered as a direction in a hidden orthogonal basis of the space. During each iteration, the Ak\ncoordinate of u is raised to the 3rd power and multiplied by a constant. Treating this iteration as a\n\ufb01xed point update, it was shown that given a random starting point, this iterative procedure converges\nrapidly to one of the columns of A (up to a choice of sign). The rate of convergence is cubic.\n\n4\n\nf (u) = \u03ba4\n\nTaking derivatives we obtain:\n\n(cid:0)(cid:88)m\n\nk=1\n\n(cid:88)m\n\n(cid:104)Ak, u(cid:105)Sk + (cid:104)u, \u03b7(cid:105)(cid:1) =\n(cid:88)m\n(cid:88)m\n\n(cid:104)Ak, u(cid:105)3\u03ba4(Sk)Ak\n(cid:104)Ak, u(cid:105)2\u03ba4(Sk)AkAT\n\nk=1\n\n\u2207f (u) = 4\nHf (u) = 12\n\nk=1\n\n(cid:104)Ak, u(cid:105)4\u03ba4(Sk) .\n\n(1)\n\n\fHowever, the GI-ICA algorithm requires a somewhat complicated preprocessing step called\nquasi-orthogonalization to linearly transform the data to make columns of A orthogonal. Quasi-\northogonalization makes use of evaluations of Hessians of the fourth cumulant function to construct\na matrix of the form C = ADAT where D has all positive diagonal entries\u2014a task which is com-\nplicated by the possibility that the latent signals Si may have fourth order cumulants of differing\nsigns\u2014and requires taking the matrix square root of a positive de\ufb01nite matrix of this form. How-\never, the algorithm used for constructing C under sampling error is not always positive de\ufb01nite in\npractice, which can make the preprocessing step fail. We will show how our PEGI algorithm makes\nquasi-orthogonalization unnecessary, in particular, resolving this issue.\n\n2.2 Gradient Iteration in a Pseudo-Euclidean Space\nWe now show that the gradient iteration can be performed using in a pseudo-Euclidean space\nin which the columns of A are orthogonal. The natural candidate for the \u201cinner product space\u201d\nwould be to use (cid:104)\u00b7,\u00b7(cid:105)\u2217 de\ufb01ned as (cid:104)u, v(cid:105)\u2217 := uT (AAT )\u2020v. Clearly, (cid:104)Ai, Aj(cid:105)\u2217 = \u03b4ij gives the\ndesired orthogonality property. However, there are two issues with this \u201cinner product space\u201d:\nFirst, it is only an inner product space when A is invertible. This turns out not to be a major\nissue, and we move forward largely ignoring this point. The second issue is more fundamen-\ntal: We only have access to AAT in the noise free setting where cov(X) = AAT . In the noisy\nsetting, we have access to matrices of the form Hf (u) = AD(u)AT from equation (2) instead.\nWe consider a pseudo-Euclidean inner product de-\nAlgorithm 1 Recovers a column of A up to a\n\ufb01ned as follows: Let C = ADAT where D is a\nscaling factor if u0 is generically chosen.\ndiagonal matrix with non-zero diagonal entries, and\nde\ufb01ne (cid:104)\u00b7,\u00b7(cid:105)C by (cid:104)u, v(cid:105)C = uT C\u2020v. When D con-\ntains negative entries, this is not a proper inner prod-\nuct since C is not positive de\ufb01nite.\nIn particular,\n(cid:104)Ak, Ak(cid:105)C = AT\nk (ADAT )\u2020Ak = d\u22121\nkk may be neg-\native. Nevertheless, when k (cid:54)= j, (cid:104)Ak, Aj(cid:105)C =\nk (ADAT )\u2020Aj = 0 gives that the columns of A\nAT\nare orthogonal in this space.\nWe de\ufb01ne functions \u03b1k : Rn \u2192 R by \u03b1k(u) = (A\u2020u)k such that for any u \u2208 span(A1, . . . , Am),\ni=1 \u03b1i(u)Ai is the expansion of u in its Ai basis. Continuing from equation (1), for any\nC\u03ba4(Sk)Ak is the\n\nthen u =(cid:80)m\nu \u2208 Sn\u22121 we see \u2207f (C\u2020u) = 4(cid:80)n\n\nInputs: Unit vector u0, C, \u2207f\nk \u2190 1\nrepeat\n\nuk \u2190 \u2207f (C\u2020uk\u22121)/(cid:107)\u2207f (C\u2020uk\u22121)(cid:107)\nk \u2190 k + 1\n\nuntil Convergence (up to sign)\nreturn uk\n\nk=1(cid:104)Ak, u(cid:105)3\ngradient iteration recast in the (cid:104)\u00b7,\u00b7(cid:105)C space. Expanding u in its Ak basis, we obtain\n\u03b1k(u)3(d\u22123\n\n(\u03b1k(u)(cid:104)Ak, Ak(cid:105)C)3\u03ba4(Sk)Ak = 4\n\nk=1(cid:104)Ak, C\u2020u(cid:105)3\u03ba4(Sk)Ak = 4(cid:80)n\n(cid:88)m\n\n\u2207f (C\u2020u) = 4\n\n(cid:88)m\n\nkk \u03ba4(Sk))Ak ,\n\n(3)\n\nk=1\n\nk=1\n\nk=0 converges to v up to sign if there exists a sequence {ck}\u221e\n\nwhich is a power iteration in the unseen Ak coordinate system. As no assumptions are made upon the\n\u03ba4(Sk) values, the d\u22123\nkk scalings which were not present in eq. (1) cause no issues. Using this update,\nwe obtain Alg. 1, a \ufb01xed point method for recovering a single column of A up to an unknown scaling.\nBefore proceeding, we should clarify the notion of \ufb01xed point convergence in Algorithm 1. We say\nthat the sequence {uk}\u221e\nk=0 such that\neach ck \u2208 {\u00b11} and ckuk \u2192 v as k \u2192 \u221e. We have the following convergence guarantee.\nTheorem 1. If u0 is chosen uniformly at random from Sn\u22121, then with probability 1, there exists\n(cid:96) \u2208 [m] such that the sequence {uk}\u221e\nk=0 de\ufb01ned as in Algorithm 1 converges to A(cid:96)/(cid:107)A(cid:96)(cid:107) up to sign.\nFurther, the rate of convergence is cubic.\nDue to limited space, we omit the proof of Theorem 1. It is similar to the proof of [21, Theorem 4].\nIn practice, we test near convergence by checking if we are still making signi\ufb01cant progress. In\nparticular, for some prede\ufb01ned \u0001 > 0, if there exists a sign value ck \u2208 {\u00b11} such that (cid:107)uk \u2212\nckuk\u22121(cid:107) < \u0001, then we declare convergence achieved and return the result. As there are only two\nchoices for ck, this is easily checked, and we exit the loop if this condition is met.\nFull ICA Recovery Via the Pseudo-Euclidean GI-Update. We are able to recover a single column\nof A up to its unknown scale. However, for full recovery of A, we would like (given recovered\ncolumns A(cid:96)1 , . . . , A(cid:96)j ) to be able to recover a column Ak such that k (cid:54)\u2208 {(cid:96)1, . . . , (cid:96)j} on demand.\nThe idea behind the simultaneous recovery of all columns of A is two-fold. First, instead of just\n\ufb01nding columns of A using Algorithm 1, we simultaneously \ufb01nd rows of A\u2020. Then, using the\n\n5\n\n\f\u2020\nk\u00b7 up to an arbitrary, unknown constant d\u22121\n\n(cid:80)m\n\nrecovered columns of A and rows of A\u2020, we project u onto the orthogonal complement of the\nrecovered columns of A within the (cid:104)\u00b7,\u00b7(cid:105)C pseudo-inner product space.\nRecovering rows of A\u2020. Suppose we have access to a column Ak (which may be achieved using\n\u2020\nk\u00b7 denote the kth row of A\u2020. Then, we note that C\u2020Ak = (ADAT )\u2020Ak =\nAlgorithm 1). Let A\n\u2020\n\u2020\nd\u22121\nk = d\u22121\nkk . However, the\nk\u00b7)T recovers A\nkk (AT )\nkk (A\nkk may be recovered by noting that (cid:104)Ak, Ak(cid:105)C = (C\u2020Ak)T Ak = d\u22121\nconstant d\u22121\nkk . As such, we may\n\u2020\nk\u00b7 as [C\u2020Ak/((C\u2020Ak)T Ak)]T .\nestimate A\nEnforcing Orthogonality During the GI\nAlgorithm 2 Full ICA matrix recovery algorithm.\nUpdate. Given access to a vector u =\nReturns two matrices: (1) \u02dcA is the recovered mix-\nk=1 \u03b1k(u)Ak + PA\u22a5 u (where PA\u22a5 is the\ning matrix for the noisy ICA model X = AS + \u03b7,\nprojection onto the orthogonal complements\nand (2) \u02dcB is a running estimate of \u02dcA\u2020.\nof the range of A), some recovered columns\n1: Inputs: C, \u2207f\nA(cid:96)1 , . . . , A(cid:96)r, and corresponding rows of A\u2020,\n2: \u02dcA \u2190 0, \u02dcB \u2190 0\nwe may zero out the components of u corre-\n3: for j \u2190 1 to m do\nsponding to the recovered columns of A. Let-\n\u2020\n(cid:96)j\u00b7u, then u(cid:48) =\n4:\n5:\nk\u2208[m]\\{(cid:96)1,...,(cid:96)r} \u03b1k(u)Ak + PA\u22a5 u. In partic-\nular, u(cid:48) is orthogonal (in the (cid:104)\u00b7,\u00b7(cid:105)C space) to the\n6:\n7:\npreviously recovered columns of A. This allows\n8:\nthe non-orthogonal gradient iteration algorithm\n9:\nto recover a new column of A.\n10:\nUsing these ideas, we obtain Algorithm 2, which\n11: end for\nis the PEGI algorithm for recovery of the mix-\n12: return \u02dcA, \u02dcB\ning matrix A in noisy ICA up to the inherent\n(cid:80)n\nambiguities of the problem. Within this Algorithm, step 6 enforces orthogonality with previously\nfound columns of A, guaranteeing that convergence to a new column of A.\nk=1 Hf (ek), as it can be\nPractical Construction of C. In our implementation, we set C = 1\nk=1 Hf (ek) = ADAT with dkk = (cid:107)Ak(cid:107)2\u03ba4(Sk). This determinis-\n12\n\nshown from equation (2) that(cid:80)n\n\nuntil Convergence (up to sign)\n\u02dcAj \u2190 u\n\u02dcBj\u00b7 \u2190 [C\u2020Aj/((C\u2020Aj)T Aj)]T\n\nDraw u uniformly at random from Sn\u22121.\nrepeat\n\nu \u2190 u \u2212 \u02dcA \u02dcBu\nu \u2190 \u2207f (C\u2020u)/(cid:107)\u2207f (C\u2020u)(cid:107).\n\nting u(cid:48) = u \u2212(cid:80)r\n(cid:80)\n\nj=1 A(cid:96)j A\n\ntically guarantees that each latent signal has a signi\ufb01cant contribution to C.\n\n3 SINR Optimal Recovery in Noisy ICA\nIn this section, we demonstrate how to perform SINR optimal ICA within the noisy ICA framework\ngiven access to an algorithm (such as PEGI) to recover the directions of the columns of A. To this\nend, we \ufb01rst discuss the SINR optimal demixing solution within any decomposition of the ICA model\ninto signal and noise as X = AS + \u03b7. We then demonstrate that the SINR optimal demixing matrix\nis actually the same across all possible model decompositions, and that it can be recovered. The\nresults in this section hold in greater generality than in section 2. They hold even if m \u2265 n (the\nunderdetermined setting) and even if the additive noise \u03b7 is non-Gaussian.\nConsider B an m \u00d7 n demixing matrix, and de\ufb01ne \u02c6S(B) := BX the resulting approximation to\nS. It will also be convenient to estimate the source signal S one coordinate at a time: Given a row\nvector b, we de\ufb01ne \u02c6S(b) := bX. If b = Bk\u00b7 (the kth row of B), then \u02c6S(b) = [\u02c6S(B)]k = \u02c6Sk(B)\nis our estimate to the kth latent signal Sk. Within a speci\ufb01c ICA model X = AS + \u03b7, signal to\nintereference-plus-noise ratio (SINR) is de\ufb01ned by the following equation:\n\nSINRk(b) :=\n\nvar(bAkSk)\n\nvar(bAS \u2212 bAkSk) + var(b\u03b7)\n\n=\n\nvar(bAkSk)\n\nvar(bAX) \u2212 var(bAkSk)\n\n.\n\n(4)\n\nSINRk is the variance of the contribution of kth source divided by the variance of the noise and\ninterference contributions within the signal.\nGiven access to the mixing matrix A, we de\ufb01ne Bopt = AH (AAH + cov(\u03b7))\u2020. Since cov(X) =\nAAH + cov(\u03b7), it follows that Bopt = AH cov(X)\u2020. Here, cov(X)\u2020 may be estimated from data,\nbut due to the ambiguities of the noisy ICA model, A (and speci\ufb01cally its column norms) cannot be.\nKoldovsk`y and Tichavsk`y [15] observed that when \u03b7 is a white Gaussian noise, Bopt jointly maxi-\nmizes SINRk for each k \u2208 [m], i.e., SINRk takes on its maximal value at (Bopt)k\u00b7. We generalize\nthis result in Proposition 2 below to include arbitrary non-spherical, potentially non-Gaussian noise.\n\n6\n\n\f(a) Accuracy under additive Gaussian noise.\n\n(b) Bias under additive Gaussian noise.\n\nFigure 1: SINR performance comparison of ICA algorithms.\n\nIt is interesting to note that even after the data is whitened, i.e. cov(X) = I, the optimal SINR\nsolution is different from the optimal solution in the noiseless case unless A is an orthogonal matrix,\ni.e. A\u2020 = AH. This is generally not the case, even if \u03b7 is white Gaussian noise.\nProposition 2. For each k \u2208 [m], (Bopt)k\u00b7 is a maximizer of SINRk.\nThe proof of Proposition 2 can be found in the supplementary material.\nSince SINR is scale invariant, Proposition 2 implies that any matrix of the form DBopt =\nDAH cov(X)\u2020 where D is a diagonal scaling matrix (with non-zero diagonal entries) is an SINR-\noptimal demixing matrix. More formally, we have the following result.\nTheorem 3. Let \u02dcA be an n \u00d7 m matrix containing the columns of A up to scale and an arbitrary\npermutation. Then, ( \u02dcAH cov(X)\u2020)\u03c0(k)\u00b7 is a maximizer of SINRk.\nBy Theorem 3, given access to a matrix \u02dcA which recovers the directions of the columns of A, then\n\u02dcAH cov(X)\u2020 is the SINR-optimal demixing matrix. For ICA in the presence of Gaussian noise, the\ndirections of the columns of A are well de\ufb01ned simply from X, that is, the directions of the columns\nof A do not depend on the decomposition of X into signal and noise (see the discussion in section 1.1\non ICA indeterminacies). The problem of SINR optimal demixing is thus well de\ufb01ned for ICA in\nthe presence of Gaussian noise, and the SINR optimal demixing matrix can be estimated from data\nwithout any additional assumptions on the magnitude of the noise in the data.\nFinally, we note that in the noise-free case, the SINR-optimal source recovery simpli\ufb01es to be \u02dcA\u2020.\nCorollary 4. Suppose that X = AS is a noise free (possibly underdetermined) ICA model. Suppose\nthat \u02dcA \u2208 Rn\u00d7m contains the columns of A up to scale and permutation, i.e., there exists diagonal\nmatrix D with non-zero entries and a permutation matrix \u03a0 such that \u02dcA = AD\u03a0. Then \u02dcA\u2020 is an\nSINR-optimal demixing matrix.\nCorollary 4 is consistent with known beamforming results. In particular, it is known that A\u2020 is optimal\n(in terms of minimum mean squared error) for underdetermined ICA [19, section 3B].\n\n4 Experimental Results\nWe compare the proposed PEGI algorithm with existing ICA algorithms. In addition to qorth+GI-ICA\n(i.e., GI-ICA with quasi-orthogonalization for preprocessing), we use the following baselines:\nJADE [3] is a popular fourth cumulant based ICA algorithm designed for the noise free setting. We\nuse the implementation of Cardoso and Souloumiac [5].\nFastICA [12] is a popular ICA algorithm designed for the noise free setting based on a de\ufb02ationary\napproach of recovering one component at a time. We use the implementation of G\u00a8avert et al. [10].\n1FICA [16, 17] is a variation of FastICA with the tanh contrast function designed to have low bias\nfor performing SINR-optimal beamforming in the presence of Gaussian noise.\nAinv performs oracle demixing algorithm which uses A\u2020 as the demixing matrix.\nSINR-opt performs oracle demixing using AH cov(X)\u2020 to achieve SINR-optimal demixing.\n\n7\n\n\fWe compare these algorithms on simulated data with n = m. We constructed mixing matrices A\nwith condition number 3 via a reverse singular value decomposition (A = U \u039bV T ). The matrices U\nand V were random orthogonal matrices, and \u039b was chosen to have 1 as its minimum and 3 as its\nmaximum singular values, with the intermediate singular values chosen uniformly at random. We\ndrew data from a noisy ICA model X = AS + \u03b7 where cov(\u03b7) = \u03a3 was chosen to be malaligned\nwith cov(AS) = AAT . We set \u03a3 = p(10I \u2212 AAT ) where p is a constant de\ufb01ning the noise power.\nIt can be shown that p = maxv var(vT \u03b7)\nmaxv var(vT AS) is the ratio of the maximum directional noise variance to\nthe maximum directional signal variance. We generated 100 matrices A for our experiments with\n100 corresponding ICA data sets for each sample size and noise power. When reporting results, we\napply each algorithm to each of the 100 data sets for the corresponding sample size and noise power\nand we report the mean performance. The source distributions used in our ICA experiments were the\nLaplace and Bernoulli distribution with parameters 0.05 and 0.5 respectively, the t-distribution with\n3 and 5 degrees of freedom respectively, the exponential distribution, and the uniform distribution.\nEach distribution was normalized to have unit variance, and the distributions were each used twice to\ncreate 14-dimensional data. We compare the algorithms using either SINR or the SINR loss from the\noptimal demixing matrix (de\ufb01ned by SINR Loss = [Optimal SINR \u2212 Achieved SINR]).\nIn Figure 1b, we compare our proprosed ICA algorithm with various ICA algorithms for signal\nrecovery.\nIn the PEGI-\u03ba4+SINR algorithm, we use PEGI-\u03ba4 to estimate A, and then perform\ndemixing using the resulting estimate of AH cov(X)\u22121, the formula for SINR-optimal demixing. It\nis apparent that when given suf\ufb01cient samples, PEGI-\u03ba4+SINR provides the best SINR demixing.\nJADE, FastICA-tanh, and 1FICA each have a bias in the presence of additive Gaussian noise which\nkeeps them from being SINR-optimal even when given many samples.\n\nFigure 2: Accuracy comparison of PEGI using\npseudo-inner product spaces and GI-ICA using\nquasi-orthogonalization.\n\nIn Figure 1a, we compare algorithms at vari-\nous sample sizes. The PEGI-\u03ba4+SINR algo-\nrithm relies more heavily on accurate estimates\nof fourth order statistics than JADE, and the\nFastICA-tanh and 1FICA algorithms do not re-\nquire the estimation of fourth order statistics.\nFor this reason, PEGI-\u03ba4+SINR requires more\nsamples than the other algorithms in order to be\nrun accurately. However, once suf\ufb01cient sam-\nples are taken, PEGI-\u03ba4+SINR outperforms the\nother algorithms including 1FICA, which is de-\nsigned to have low SINR bias. We also note\nthat while not reported in order to avoid clut-\nter, the kurtosis-based FastICA performed very\nsimilarly to FastICA-tanh in our experiments.\nIn order to avoid clutter, we did not include\nqorth+GI-ICA-\u03ba4+SINR (the SINR optimal\ndemixing estimate constructed using qorth+GI-\nICA-\u03ba4 to estimate A) in the \ufb01gures 1b and 1a. It is also assymptotically unbiased in estimating\nthe directions of the columns of A, and similar conclusions could be drawn using qorth+GI-ICA-\u03ba4\nin place of PEGI-\u03ba4. However, in Figure 2, we see that PEGI-\u03ba4+SINR requires fewer samples\nthan qorth+GI-ICA-\u03ba4+SINR to achieve good performance. This is particularly highlighted in the\nmedium sample regime.\nOn the Performance of Traditional ICA Algorithms for Noisy ICA. An interesting observation\n[\ufb01rst made in 15] is that the popular noise free ICA algorithms JADE and FastICA perform reasonably\nwell in the noisy setting. In Figures 1b and 1a, they signi\ufb01cantly outperform demixing using A\u22121 for\nsource recovery. It turns out that this may be explained by a shared preprocessing step. Both JADE\nand FastICA rely on a whitening preprocessing step in which the data are linearly transformed to\nhave identity covariance. It can be shown in the noise free setting that after whitening, the mixing\nmatrix A is a rotation matrix. These algorithms proceed by recovering an orthogonal matrix \u02dcA to\napproximate the true mixing matrix A. Demixing is performed using \u02dcA\u22121 = \u02dcAH. Since the data is\nwhite (has identity covariance), then the demixing matrix \u02dcAH = \u02dcAH cov(X)\u22121 is an estimate of the\nSINR-optimal demixing matrix. Nevertheless, the traditional ICA algorithms give a biased estimate\nof A under additive Gaussian noise.\n\n8\n\n\fReferences\n[1] L. Albera, A. Ferr\u00b4eol, P. Comon, and P. Chevalier. Blind identi\ufb01cation of overcomplete mixtures of sources\n\n(BIOME). Linear algebra and its applications, 391:3\u201330, 2004.\n\n[2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications\n\nfor Gaussian mixtures and autoencoders. In NIPS, pages 2384\u20132392, 2012.\n\n[3] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal\n\nProcessing, IEE Proceedings F, volume 140(6), pages 362\u2013370. IET, 1993.\n\n[4] J.-F. Cardoso. Super-symmetric decomposition of the fourth-order cumulant tensor. Blind identi\ufb01cation of\n\nmore sources than sensors. In ICASSP, pages 3109\u20133112. IEEE, 1991.\n\n[5] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http://perso.\ntelecom-paristech.fr/\u02dccardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-\n2013].\n\n[6] P. Chevalier. Optimal separation of independent narrow-band sources: Concept and performance 1. Signal\n\nProcessing, 73(12):27 \u2013 47, 1999. ISSN 0165-1684.\n\n[7] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010.\n\n[8] L. De Lathauwer, B. De Moor, and J. Vandewalle. Independent component analysis based on higher-order\nstatistics only. In Statistical Signal and Array Processing, 1996. Proceedings., 8th IEEE Signal Processing\nWorkshop on, pages 356\u2013359. IEEE, 1996.\n\n[9] L. De Lathauwer, J. Castaing, and J. Cardoso. Fourth-order cumulant-based blind identi\ufb01cation of\nunderdetermined mixtures. Signal Processing, IEEE Transactions on, 55(6):2965\u20132973, June 2007. ISSN\n1053-587X. doi: 10.1109/TSP.2007.893943.\n\n[10] H. G\u00a8avert, J. Hurri, J. S\u00a8arel\u00a8a, and A. Hyv\u00a8arinen. Matlab FastICA v 2.5. http://research.ics.\n\naalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013].\n\n[11] N. Goyal, S. Vempala, and Y. Xiao. Fourier PCA and robust tensor decomposition. In STOC, pages\n\n584\u2013593, 2014.\n\n[12] A. Hyv\u00a8arinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Networks,\n\n13(4-5):411\u2013430, 2000.\n\n[13] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent component analysis. John Wiley & Sons, 2001.\n\n[14] M. Joho, H. Mathis, and R. H. Lambert. Overdetermined blind source separation: Using more sensors than\nsource signals in a noisy mixture. In Proc. International Conference on Independent Component Analysis\nand Blind Signal Separation. Helsinki, Finland, pages 81\u201386, 2000.\n\n[15] Z. Koldovsk`y and P. Tichavsk`y. Methods of fair comparison of performance of linear ICA techniques in\n\npresence of additive noise. In ICASSP, pages 873\u2013876, 2006.\n\n[16] Z. Koldovsk`y and P. Tichavsk`y. Asymptotic analysis of bias of fastica-based algorithms in presence of\n\nadditive noise. Technical report, Technical report, 2007.\n\n[17] Z. Koldovsk`y and P. Tichavsk`y. Blind instantaneous noisy mixture separation with best interference-plus-\nnoise rejection. In Independent Component Analysis and Signal Separation, pages 730\u2013737. Springer,\n2007.\n\n[18] S. Makino, T.-W. Lee, and H. Sawada. Blind speech separation. Springer, 2007.\n\n[19] B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatial \ufb01ltering. IEEE assp\n\nmagazine, 5(2):4\u201324, 1988.\n\n[20] R. Vig\u00b4ario, J. Sarela, V. Jousmiki, M. Hamalainen, and E. Oja. Independent component approach to the\nanalysis of EEG and MEG recordings. Biomedical Engineering, IEEE Transactions on, 47(5):589\u2013593,\n2000.\n\n[21] J. R. Voss, L. Rademacher, and M. Belkin. Fast algorithms for Gaussian noise invariant independent\ncomponent analysis. In Advances in Neural Information Processing Systems 26, pages 2544\u20132552. 2013.\n\n[22] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):\n\n897\u2013902, 2000.\n\n[23] A. Yeredor. Non-orthogonal joint diagonalization in the least-squares sense with application in blind source\n\nseparation. Signal Processing, IEEE Transactions on, 50(7):1545\u20131553, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1644, "authors": [{"given_name": "James", "family_name": "Voss", "institution": null}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}, {"given_name": "Luis", "family_name": "Rademacher", "institution": "The Ohio State University"}]}