{"title": "A Probabilistic Approach to Single Channel Blind Signal Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 1197, "page_last": 1204, "abstract": null, "full_text": "A Probabilistic Approach to Single Channel\n\nBlind Signal Separation\n\nGil-Jin Jang\n\nSpoken Language Laboratory\n\nKAIST, Daejon 305-701, South Korea\n\njangbal@bawi.org\n\nhttp://speech.kaist.ac.kr/\u02dcjangbal\n\nTe-Won Lee\n\nInstitute for Neural Computation\nUniversity of California, San Diego\n\nLa Jolla, CA 92093, U.S.A.\n\ntewon@inc.ucsd.edu\n\nAbstract\n\nWe present a new technique for achieving source separation when given\nonly a single channel recording. The main idea is based on exploiting the\ninherent time structure of sound sources by learning a priori sets of basis\n\ufb01lters in time domain that encode the sources in a statistically ef\ufb01cient\nmanner. We derive a learning algorithm using a maximum likelihood\napproach given the observed single channel data and sets of basis \ufb01lters.\nFor each time point we infer the source signals and their contribution\nfactors. This inference is possible due to the prior knowledge of the\nbasis \ufb01lters and the associated coef\ufb01cient densities. A \ufb02exible model\nfor density estimation allows accurate modeling of the observation and\nour experimental results exhibit a high level of separation performance\nfor mixtures of two music signals as well as the separation of two voice\nsignals.\n\n1 Introduction\n\nis the\n\n\u001b\u001d\u001c\u001f\u001e\n\nis an addition of\n\u0001\u0005\u0004\u0007\u0006\t\b\u000b\n\f\u0001\n\n\b\u000e\r\nsampled value of the\n\nindependent source signals\n\u000f\u0012\r\u0014\u0013\u0011\u0013\u0011\u0013\u0015\r\n\u0016\u0019\u0018\nsource signal, and \u0006\n\nExtracting individual sound sources from an additive mixture of different signals has been\nattractive to many researchers in computational auditory scene analysis (CASA) [1] and\nindependent component analysis (ICA) [2]. In order to formulate the problem, we assume\nthat the observed signal \u0002\u0001\n(1)\n\u001a is the gain of each source\nwhere \n\nwhich is \ufb01xed over time. Note that superscripts indicate sample indices of time-varying\nsignals and subscripts indicate the source identi\ufb01cation. The gain constants are affected\nby several factors, such as powers, locations, directions and many other characteristics of\nthe source generators as well as sensitivities of the sensors. It is convenient to assume all\nthe sources to have zero mean and unit variance. The goal is to recover all \n\n\u001a given only\n\u0001 . The problem is too ill-conditioned to be mathematically tractable\na single sensor input \nsince the number of unknowns is\nobservations. Several earlier\nattempts [3, 4, 5, 6] to this problem have been proposed based on the presumed properties\nof the individual sounds in the frequency domain.\n\n\u0006\u0010\u000f\u0011\n\u0010\u0001\n \u001d\u001c\u001f\u001e\n\ngiven only\n\n\u0006\u0010\u0016\u0017\n\f\u0001\n\n\u0003\"!\n\nICA is a data driven method which relaxes the strong characteristical frequency structure\nassumptions. However, ICA algorithms perform best when the number of the observed\n\n\u0003\n\n\u0001\n\u001a\n\u0001\n\n\u0003\n!\n\f=\n\nl\n\n\u0004\u0006\u0005\n\n=\n\n\f\u000e\r\u000f\n\n\b\n\t\n\n+\n\nl\n\n+\n\n\u0011\u000e\u0012\u0014\u0013\n\n+\n\n\u001f! #\"\n\n\u0017\u0014\u0018\n\n\u0015\u0014\u0015\n\n$&%\n\nq=0.99\n\nq=0.52\n\nq=0.26\n\nq=0.12\n\nA\n\nB\n\nC\n\nFigure 1: Generative models for the observed mixture and original source signals (A) A\nsingle channel observation is generated by a weighted sum of two source signals with dif-\n\u001a*) ) linear\n\u001a*) ). (C) Examples of the actual coef\ufb01cient distributions.\n\nferent characteristics. (B) Individual source signals are generated by weighted ((\nsuperpositions of basis functions (+\nGaussian density functions in the form of ,.-/(\n\nThey generally have more sharpened summits and longer tails than a Gaussian distribution,\nand would be classi\ufb01ed as super-Gaussian. The distributions are modeled by generalized\n\nmatches to the non-Gaussian distributions by varying exponents. From left to right, the\nexponent decreases, and the distribution becomes more super-Gaussian.\n\n0 , which provide good\n\n\u001a*)10325476\u000e8\n\n\u001a*)\n\n-\u000f9;:\n\nsignals is greater than or equal to the number of sources [2]. Although some recent over-\ncomplete representations may relax this assumption, the problem of separating sources\nfrom a single channel observation remains dif\ufb01cult. ICA has been shown to be highly ef-\nfective in other aspects such as encoding speech signals [7] and natural sounds [8]. The\nbasis functions and the coef\ufb01cients learned by ICA constitute an ef\ufb01cient representation of\nthe given time-ordered sequences of a sound source by estimating the maximum likelihood\ndensities, thus re\ufb02ecting the statistical structures of the sources.\n\nThe method presented in this paper aims at exploiting the ICA basis functions for separating\nmixed sources from a single channel observation. Sets of basis functions are learned a\npriori from a training data set and these sets are used to separate the unknown test sound\nsources. The algorithm recovers the original auditory streams in a number of gradient-\nascent adaptation steps maximizing the log-likelihood of the separated signals, calculated\nusing the basis functions and the probability density functions (pdf\u2019s) of their coef\ufb01cients\n\u2014the output of the ICA basis \ufb01lters. The object function not only makes use of the ICA\nbasis functions as a strong prior for the source characteristics, but also their associated\ncoef\ufb01cient pdf\u2019s modeled by generalized Gaussian distributions [9]. Experiments showing\nthe separation of the two different sources was quite successful in the simulated mixtures\nof rock and jazz music, and male and female speech signals.\n\n2 Generative Models for Mixture and Source Signals\n\nThe algorithm \ufb01rst involves the learning of the time-domain basis functions of the sound\nsources that we are interested in separating from a given training database. This corre-\nsponds to the prior information necessary to successfully separate the signals. We assume\ntwo different types of generative models in the observed single channel mixture as well as\nin the original sources. The \ufb01rst one is depicted in Figure 1-A. As described in Equation\n1, at every\nsources. In our approach only the case of\n\n!DC the observed instance is assumed to be a weighted sum of different\n\nis regarded. This corresponds to the situ-\n\n\u001b>=@?BA\n\n\u0004FE\n\n\n\u0001\n(cid:215)\n\u0002\n(cid:215)\n\u0003\n\n\u0007\n(cid:215)\n\u000b\n(cid:215)\n\u0010\n\u0016\n\u0019\n\u001a\n\u001b\n\u001c\n\u001d\n(cid:215)\n\u001e\n'\n\u0001\n\u0001\n(\n\u0001\n:\n<\n\u0018\n\u0003\n\fation de\ufb01ned in Section 1 in that two different signals are mixed and observed in a single\nsensor.\n\n\u001c\u001f\u001e\n\nto\n\n\u0013\u0011\u0013\n\n\u0013\u0011\u0013\n\n\u001a*)\n\n\u0001\u0005\u0004\n\nfrom\n\ncolumn vector is then expressed as a linear combination of the basis functions such that\n\nFor the individual source signals, we adopt a decomposition-based approach as another\ngenerative model. This approach was employed formerly in analyzing sound sources [7, 8]\nby expressing a \ufb01xed-length segment drawn from a time-varying signal as a linear super-\nposition of a number of elementary patterns, called basis functions, with scalar multiples\nare chopped out of a source,\n-dimensional column\n\nA , and the subsequent segment is denoted as an\n\n(Figure 1-B). Continuous samples of length with\u0002\u0001\n\u0001\u0005\u0004\u0007\u0006\t\b\nvector in a boldface letter, \u0003\nC\u000b\n , attaching the lead-off sample\nindex for the superscript and representing the transpose operator with \n . The constructed\n\u001a*)\n\u001a\u0013\u0012\n\u0004\u0011\u0010\nis the \u0015\nis the number of basis functions, +\nwhere \u0014\nits coef\ufb01cient (weight) and \u0012\n\u001a*)\nin the form of \n\n . The r.h.s. is the matrix-vector notation. The second subscript\u0015\nrepresents the component number of the coef\ufb01cient vector \u0012\n\u001a*)\n\u001a and\u0012\nand\u0010\nhas full rank so that the transforms between\u0003\nWe assume that\u0014\nreversible in both directions. The inverse of the basis matrix,\u0016\n\ufb01lters that generate the coef\ufb01cient vector: \u0012\nis to model the multivariate distribution of\u0003\n-(\u0003\n\n\u001c\u001f\u001e\nfollowed\n\u001a .\n\u001a be\n, refers to the ICA\n\u001a . The purpose of this decomposition\nin a statistically ef\ufb01cient manner. The ICA\nlearning algorithm is equivalent to searching for the linear transformation that make the\ncomponents as statistically independent as possible, as well as maximizing the marginal\ndensities of the transformed coordinates for the given training data [10],\n\n\u0004\u0011\u001a\u001c\u001b\u001e\u001d \u001f!\u001a\n\"$#&%\n\n-dimensional column vector,\n\nby the source index\n\nbasis function of\n\n\u0004\u0017\u0010\n\nin (\n\nsource\n\n)\u000f\u000e\n\n(2)\n\n\u001a*)\n\n(3)\n\n-/(\n\n-.-\n\n\u0004\u0017\u001a)\u001b*\u001d \u001f!\u001a\n\u0016\u0019\u0018\n\"+#,%\n0 denotes the probability of the value of a variable -\n\n. Independence between\nthe components and over time samples factorizes the joint probabilities of the coef\ufb01cients\ninto the product of marginal ones. What matters is therefore how well matched the model\n\n0 . The coef\ufb01cient histogram\nof real data reveals that the distribution has a highly sharpened point at the peak with a\nlong tail (Figure 1-C). Therefore we use a generalized Gaussian prior [9] that provides an\nin the\n\nwhere'\ndistribution is to the true underlying distribution of '\naccurate estimate for symmetric non-Gaussian distributions by \ufb01tting the exponent/\nset of parameters0\n<\u001e7\n(D9+5\n\u000498\n943\n,.-/(\u000e:\n(7C ,6\nis a realized pdf of variable- and should be noted\nwhere5\ndistinctively with'\n0 . With the generalized Gaussian ICA learning algorithm [9], the\n\u001a*) are obtained beforehand and used as\nbasis functions and their individual parameter set0\n\nprior information for the following source separation algorithm.\n\n476!821\n(7C , and ,\n-(-\n\nin its simplest form\n\n\u0004?>\n-(-\n\n\u0004=<\n\n/;:\n\n(4)\n\n3 Separation Algorithm\n\nThe method is motivated by the pdf approximation property of ICA transformation (Equa-\ntion 3). The probability of the source signals is computed by the generalized Gaussian\nparameters in the transformed domain, and the method performs maximum a posteriori\n(MAP) estimation in a number of adaptation steps on the source signals to maximize the\ndata likelihood. Scaling factors of the generative model are learned as well.\n\n!\n\u001b\n\u001b\n\n\n9\n\u0001\n\u001a\n\u0004\n?\n\n\u0001\n\u001a\n\n\b\n\u001a\n\u0013\n\n\b\n\u001a\n\u0003\n\u0001\n\u001a\n\u0004\n\f\n\n\b\n+\n(\n\u0001\n\u0001\n\u001a\n\u0018\n\u001a\n)\n \n(\n\u0001\n\u0001\n\u001a\n\u0004\n?\n(\n\u0001\n\u001a\n\b\n(\n\u0001\n\u001a\n\u000f\n\u0013\n(\n\u0001\n\u001a\n\f\nC\n \n\u0001\n\u0001\n\u0004\n\n\u0001\n\u0001\n\u001a\n\b\n\b\n\u001a\n\u0001\n\u001a\n\u0004\n\u0016\n\u001a\n\u0003\n\u0001\n\u0001\n\u001a\n\u001a\n6\n\u0001\n'\n\u001b\n\u0001\n\u001a\n:\n\u0016\n\u001a\n0\n6\n\u0001\n%\n)\n'\n\u001b\n\u0001\n0\n\u0018\n\u001b\n\u001b\n-\n(\n\u0001\n\u001a\n)\n0\n0\n2\n3\n3\n3\n6\n3\n3\n3\n3\n\u0018\n0\n5\n\u0018\n6\n\u0018\n?\n@\n?\n0\n\u001b\n\f3.1 MAP estimation of Source Signals\n\n\u0013\u0011\u0013\n\n\u0004\u0014\u0006\n\n4\u0016\u0007\n\n-\u0006\t\n\n\b\r\f\u000e\f\u000e\f\n\n\u0011\u0013\u0012\n\n\u0018\u001a\u0019\n\n(5)\n\n(6)\n\n:\n\nis\n\n\b\u000b\n\n\b\r\f\u000e\f\u000e\f\n\nwhere, for convenience,\n\nis obtained from the marginal ones of all the possible segments,\n\nA . The objective function to be maximized is the\n\nparameters are given, we can perform maximum a posteriori (MAP) estimation simply by\noptimizing the data likelihood computed by the model parameters.\n\n\u001a constitutes the basis learning data\u0003\n\n\f\u0001\u0005\u0004\u0007\u0006\t\b\n,.-\n\ngenerally \u201cno\u201d when\u0001\nthe statistical information given by a set of coef\ufb01cient density parameters 0\nAt every time point a segment\u0003\nvector\u0012\n-\u0005\u0003\nwhere ,\nter group of all the coef\ufb01cients, with the notation \u2018\nelements from index\nwhole signal \n\nWe have demonstrated that the learned basis \ufb01lters maximize the likelihood of the given\ndata. Suppose we know what kind of sound sources have been mixed and we were given\nthe set of basis \ufb01lters from a training set. Could we infer the learning data? The answer is\nand no other information is given. In our problem of single\nchannel separation, half of the solution is already given by the constraint \n\u000f ,\n(Figure 1-B). Essentially, the goal of the\nwhere \n\nsource inferring algorithm presented in this paper is to complement the remaining half with\n\u001a*) . If model\n\n generates the independent coef\ufb01cient\n\b and\u0012\nrespectively. The likelihood of\u0003\n4\b\u0007\n0\u0003\u0002\n:\u0006\u0005\nis the generalized Gaussian density function, and\u0004\n\f \u2014 parame-\n\u0013\u0011\u0013\u0010\u000f \u2019 meaning an ordered set of the\nto\u000f . Assuming the independence over time, the probability of the\n\b\u000b\f\u000e\f\u000e\f\n\u0011\u0014\u0012\n-(\u0003\nmultiplication of the data likelihoods of both sound sources, and we denote its log by\u0017\n\b\u000b\f\u000e\f\u000e\f\n\u0018\u001c\u0019)\u001d\n4\u0016\u0007\n!DC , toward the maximum of\u0017\n\u0018\u001a\u0019\n\b)&'(\n)\u000f\u000e\n)\u000f\u000e\n\b2(\n\u0004DC\n\nintroduce a new variable \nadaptation is done on the values of \n\u0018\u001a\u0019\n\b'&)(\n)\u000f\u000e\n)\u000f\u000e\n\b1(\nwhich is derived by the fact that3\u0013465\n398\nOGPRQ\u0014:\u001c4\bS\n3NM\nof\u0017\nfor \n3B4\n\u0017WU\n\b or \nso learning rule on either \n\nA ,\n0 . Note that the gradient\n\b , always makes the condition \nsatisfy,\n\u000f subsumes the counterpart. The overall process of the\nproposed method is summarized as 4 steps in Figure 2. The \ufb01gure shows one iteration of\nthe adaptation of each sample.\n\n\u0001$#\n\u0001$#\n0-*\n\u000f.+0/\nand 398GF\n398\rH\n\nOur interest is in adapting \n\n(7)\n. We\n?BA\n\u001a with the contribution factor. The\n\u001a , in order to infer the sound sources and\ntheir contribution factors simultaneously. The learning rule is derived in a gradient-ascent\nmanner by summing up the gradients of all the segments where the sample lies:\n\n\u001a , a scaled value of \n\ninstead of \n\n\b and \n\n:\u0015\u0005\n\n,.-\n\n\b\u000b\f\u000e\f\u000e\f\n4\u0016\u0007\n:\u0015\u0005\nfor\u001f\t\u001b\n\u0001$#\n0\u0014*\n3;:=<\n\n\u0001$#\n\n\b,+\n#@?\n\n3;:\u001aI\n\n3B8JH\n\n\u000f ,!\n\n9LK\n\u0017VU\n\n:\u0015\u0005\n0\u001e\u001d\n\n\u0018\u001c\u0019)\u001d\n\u0018\u001a\u0019\n\n,.-\n\n,.-\n\n, and\n\n7\b>\n3BA\n\n-/(\n\n\u0006\f\u000f\n\nwhere\n\nA ,\n\n-/(\n\n-/(\n\n-.\u0015\n\n\u001a*)\n\n3BA\n3B8\n\n(8)\n\n!\n\u0001\n\b\n\n\u0001\n\b\n\n\u0006\n\u000f\n\n\u0001\n\u0001\n\u0001\n\u001a\n\u0001\n\b\n\u0004\n?\n\n\u0001\n\b\n\u0013\n\b\n\b\nC\n\u0001\n\b\n\u0004\n\u0016\n\b\n\u0003\n\u0001\n\u0001\n\u000f\n\u0004\n\u0016\n\u000f\n\u0003\n\u0001\n\u000f\n\u0001\n\b\n'\n\u001b\n\u0001\n\b\n:\n\u0016\n\b\n\u0004\n\u0012\n\u0001\n\b\n:\n\u0004\n\b\n0\n\u0016\n\b\n:\n\u0018\n0\n\b\n\u0004\n0\n \n\u0013\n \n\u0011\n\b\n'\n\u001b\n-\n\n\u0011\n\b\n:\n\u0016\n\b\n0\n\u0004\n%\n\u0001\n\u000e\n\b\n'\n\u001b\n\u0001\n\b\n:\n\u0016\n\b\n0\n\u0002\n\u0004\n%\n\u0001\n\u000e\n\b\n\u0012\n\u0001\n\b\n:\n\u0004\n\b\n0\n\u0016\n\b\n:\n\u0018\n!\n\u0006\n\u0004\n!\n9\n\n\n\u0017\n\u0004\n\u001d\n'\n\u001b\n-\n\n\u0011\n\b\n:\n\u0016\n\b\n0\n'\n\u001b\n-\n\n\u0011\n\u000f\n:\n\u0016\n\u000f\n0\n\u0002\n\u0004\n\u0011\n\u0012\n\n\u0001\n\u000e\n\b\n\u001b\n\u0012\n\u0001\n\b\n:\n\u0004\n\b\n0\n\n\u0012\n\u0001\n\u000f\n:\n\u0004\n\u000f\n\n!\n\u0006\n\u001d\n\u0016\n\b\n:\n\u0016\n\u000f\n:\n\u0013\n\u0001\n\u0001\n\u000f\n=\n\u0018\n\u0001\n\u001a\n\u0004\n\u0006\n\u001a\n\n\u0001\n\u0001\n\u0001\n\u001a\n\u0001\n!\n\u0017\n!\n \n\u0001\n\b\n\u0004\n\u0006\n\n\"\n\u000e\n\b\n1\n!\n!\n \n\u0001\n\b\n\u001d\n,\n-\n\u0012\n\b\n:\n\u0004\n\b\n0\n\n!\n!\n \n\u0001\n\b\n\u001d\n,\n-\n\u0012\n\u000f\n:\n\u0004\n\u000f\n0\n7\n\u0004\n\u0006\n\n\"\n\u000e\n\b\n%\n\u0006\n\n\b\n)\n\b\n)\n\"\n\u0006\n9\n\u0006\n\n-\n(\n\u000f\n)\n\u000f\n)\n\"\n\u0006\n2\n\u0006\n\n\"\n\u000e\n\b\n%\n\u0006\n\n-\n(\n\u0001\n#\n\b\n)\n0\n*\n\b\n)\n\"\n9\n\u0006\n\b\n\u0006\n\n\u0001\n#\n\u000f\n)\n0\n*\n\u000f\n)\n\"\n/\n\u0018\n#\n7\n5\n\u0004\n5\n5\n5\n5\n7\n#\nE\n\u0004\n\b\n8\nH\n?\n\u0004\n9\n\u001b\n\"\n\u0004\n\u001b\n\n(\n0\n\u0004\nT\n?\n*\n\"\n\u0004\n\u0016\n\u001a\n\u0018\nK\n!\n \n\u000f\n\u0004\n9\n!\n!\n \n\u0004\n \n\b\n\n \n\u000f\n\f(cid:1)\n\nA\n\nA\n\n(cid:1)\n\n (cid:1)!\n\n(cid:1)\n\n\"(cid:1)#\n\n(\n(\n(\n\n(\n(\n(\n\nj\nj\n\nj\n\nj\nj\n\nj\n\nB\n\nB\n\n)\n)\n)\n\n)\n)\n)\n\nC\n\nC\n\n(\n(\n(\n\nj\nj\n\nj\n\n(\n(\n(\n\nj\nj\n\nj\n\nJ\u0010J\nJ\u000fK\nEGF\nX\u001aY\nX\u001aX\nTGU\n\nJ\u0010J\nJ\u000fK\nD\u000bE\nX\u001aY\nX9X\nS9T\n\n.\u000f.\n.\u000b/\n*,+\n;\u001e<\n;9;\n798\n\n\u0010\n\n\u000f\u000e\n\t\u000b\n\n\u001c\u001e\u001d\n\u001c\u001a\u001c\n\u0018\u001a\u0019\n\n)\n)\n)\n\n)\n)\n)\n\nD\n\n(cid:1)\n\n(cid:1)\n(cid:2) (cid:1)\n\n(cid:1)\n\n(cid:1)\n(cid:1)\n\n, and we have the estimates of the source signals,\n\nFigure 2: The overall structure of the proposed method. We are given single channel\n\u001a , at every adaptation step.\n\u001a*) : At each timepoint, the current estimates of the source signals are passed\nthat are statistically independent.\n) : The stochastic gradient for each code is obtained by taking derivative of\n\u001a : The gradient is transformed to the source domain. (D)\nThe individual gradients are combined to be added to the current estimates of the source\nsignals.\n\ndata]\n\u001a`_\n(A) \n\nthrough basis \ufb01lters\u0016\n(B) (\n\u001a*)\nlog-likelihood. (C)a\n\n\u001a , generating\n_ca\n\u001a*)\n\nsparse codes (\n\n_ba\n\n\u001a*)\n\n\b and \u0006\n\nis completely dependent on \u0006\n\n3.2 Estimating \u0006\n\u001a can be accomplished by simply \ufb01nding the maximum\nUpdating the contribution factors \u0006\na posteriori values. To simplify the inferring steps, we force the sum of the factors to be\nA . Then \u0006\u0010\u000f\n\b , and\nconstant: e.g. \u0006\n\u0006\f\u000f\n\b only. Given the basis functions\u0016\n\u001a and the current estimate of the\nwe need to consider \u0006\nsources \n\n\b\r\f\u000e\f\u000e\f\n\b\u000b\f\u000e\f\u000e\f\nwhere ,\n\u0018\u001a\u0019\n\nis the log-likelihood of the estimated sources de\ufb01ned in Equation 7. Assuming\n\u0006\t\b , which is calculated\n\n(9)\n\b maximizing the posterior\n(10)\n\nprobability also maximizes its log,\n\n\b . The value of \u0006\n\n\b\r\f\u000e\f\u000e\f\n\n\b as \u0006\u0010\u000f\n\nis\n\nthat \u0006\nas\n\nwhere\u0017\n\n, the posterior probability of \u0006\n\nis the prior density function of \u0006\n\n\b\u000b\f\u000e\f\u000e\f\n-\u0006\t\n\u001a\u001c\u001b\u001e\u001d \u001f!\u001a\nis uniformly distributed,!\n\u0018\u001c\u0019)\u001d\n\u0018 where d\n\u0018\u001c\u0019)\u001d\n\u0004ml subject to \u0006\n\n,.-/(\n\u001a*)\n\n\u001a*)\n\nderived by the chain rule\n\n,.-/(\n\n\u0018\u001c\u0019)\u001d\nSolving equation!\nThese values guarantee the local maxima of\u0017 w.r.t. the current estimates of source signals.\n\nThe algorithm updates the contribution factors periodically during the learning steps.\n\nC gives\n\n\u0017WU\n\n\u001aih\n\n\u0006\u0010\u000f\n\n\u0006\f\u000f\n\n(12)\n\n(13)\n\n\b\r\f\u000e\f\u000e\f\n\u0017WU\n\n0\u001ee\n\n)gf\n\u001akj\n\n:\u0014U\n\n\u0011\u0013\u0012\n\n)\u000f\u000e\nA and \u0006\n\n\b2(\n\n(11)\n\n\u0001\n\u0002\n\u0003\n\u0003\n\u0003\n\u0003\n\u0003\n\u0004\n\u0005\n\u0006\n\u0006\n\u0006\n\u0006\n\u0006\n\u0007\n\b\n\t\n\t\n\f\n\f\n\f\n\n\u0011\n\u0012\n\u0012\n\u0012\n\u0012\n\u0012\n\u0013\n\u0014\n\u0015\n\u0015\n\u0015\n\u0015\n\u0015\n\u0016\n\u0017\n\u0018\n\u0018\n\u001b\n\u001b\n\u001b\n\u001c\n\u001f\nD\nD\n$\n$\n$\n$\n$\n%\n&\n'\n'\n'\n'\n'\n(\n)\n*\n*\n-\n-\n-\n.\n0\n1\n1\n1\n1\n1\n2\n3\n4\n4\n4\n4\n4\n5\n6\n7\n7\n:\n:\n:\n;\n=\n>\n>\n>\n>\n>\n?\n@\nA\nA\nA\nA\nA\nB\nC\n(cid:215)\n(cid:215)\n(cid:215)\nD\nF\nD\nF\nH\nI\nH\nI\nH\nI\nJ\nJ\nL\nM\nM\nM\nM\nM\nN\nO\nP\nP\nP\nP\nP\nQ\nR\n(cid:215)\n(cid:215)\n(cid:215)\nS\nU\nS\nU\nV\nW\nV\nW\nV\nW\nX\nX\nZ\n[\n\\\n\u0001\n^\n\n\u0001\n\u0001\n(\n\u0001\n\u0001\n\u0001\n(\n\u0001\n\u001a\n(\n\u0001\n\n\u0001\n\u000f\n\b\n\n\u0004\n\u0004\nA\n9\n\u0006\n\u0011\n\u001a\n\b\n'\n\u001b\n-\n\u0006\n\b\n:\n\n\u0011\n\b\n\u0018\n\n\u0011\n\u000f\n\u0018\n\u0016\n\b\n\u0018\n\u0016\n\u000f\n0\n2\n'\n\u001b\n-\n\n\u0011\n\b\n:\n\u0016\n\b\n0\n'\n\u001b\n-\n\n\u0011\n\u000f\n:\n\u0016\n\u000f\n0\n,\nE\n-\n\u0006\n\b\n0\n\u0018\nE\n0\n\u0006\n\u0018\n\b\n\u0004\n6\nE\nH\n8\n\u0017\n\n\u001d\n,\nE\n-\n\u0006\n\b\n0\n:\n\u0018\n\b\n8\n\u0017\n\n,\nE\n-\n\u0006\n\b\n0\n!\n\u0006\n\b\n\u0004\n!\n!\n!\n\u0017\n!\n\u0006\n\b\n\u0004\n9\nd\n\b\n\u0006\n\u000f\n\b\n\nd\n\u000f\n\u0006\n\u000f\n\u000f\n\u001a\n\u0004\n\n\u0001\n\u000e\n\b\n\u0006\n\n-\n(\n\u0001\n\u001a\n)\n\u001a\n\u0001\n\u001a\n!\n\u0001\n0\n!\n\u0006\n\u001a\n\u0004\n!\n\u0001\n\u001a\n)\n0\n!\n(\n\u0001\n!\n(\n\u0001\n\u001a\n)\n!\n\u0006\n\u001a\n\u0004\n(\n-\n(\n\u0001\n\u001a\n)\n0\n\t\ne\n\u001a\n)\nf\n\u0001\n9\nA\n\u0006\n\u000f\n\u0013\n!\n\u0006\n\b\n\b\n\n\u0004\n\b\n\u0018\n=\n?\nl\n\u0018\nA\n\u0006\n\u0018\n\b\n\u0004\n>\n:\nd\n\b\n:\n>\n:\nd\n\b\n:\n\n>\n:\nd\n\u000f\n:\n\u0018\n\u0006\n\u0018\n\u000f\n\u0004\n>\n:\nd\n\u000f\n:\n>\n:\nd\n\b\n:\n\n>\n:\nd\n\u000f\n:\n\u0013\n\fq=0.61\n\n2\n1.5\n1\n0.5\n0\n-2\n\n1\n\n0.5\n\nq=0.82\n\n1\n\n0.5\n\nq=0.80\n\n0\n\n2\n\n0\n-2\n\n0\n\n2\n\n0\n-5\n\n0\n\n5\n\nq=0.47\n\n4\n3\n2\n1\n0\n-5\n\nq=0.53\n\n3\n\n2\n\n1\n\n0\n\n5\n\n0\n-5\n\n0\n\n5\n\n(a) Rock music\n\nq=0.26\n\n60\n\n40\n\n20\n\n0\n-2\n\n0\n\n2\n\nq=0.26\n\n40\n30\n20\n10\n0\n-5\n\n0\n\n5\n\nq=0.30\n\n20\n15\n10\n5\n0\n-2\n\n(c) Male speech\n\nq=0.29\n\n30\n\n20\n\n10\n\nq=0.29\n\n30\n\n20\n\n10\n\n0\n\n2\n\n0\n-2\n\n0\n\n2\n\n0\n-2\n\n0\n\n2\n\nSignal\n\nBasis\n\nFunctions\n\nCoef\u2019s\n\nPDF\n\nSignal\n\nBasis\n\nFunctions\n\nCoef\u2019s\n\nPDF\n\nq=0.43\n\n6\n\n4\n\n2\n\nq=0.64\n\n1.5\n\n1\n\n0.5\n\n0\n-5\n\n0\n\n5\n\n0\n-5\n\n0\n\n5\n\nq=1.19\n\n0.8\n0.6\n0.4\n0.2\n0\n-5\n\n(b) Jazz music\n\nq=0.34\n\n15\n\n10\n\n5\n\nq=0.78\n\n1.5\n\n1\n\n0.5\n\n0\n\n5\n\n0\n-5\n\n0\n\n5\n\n0\n-5\n\n0\n\n5\n\n30\n\n15\n\n20\n\nq=0.29\n\n10\n\nq=0.34\n\n10\n\n0\n-2\n\n0\n\n2\n\n5\n\n0\n-2\n\n10\n\n5\n\nq=0.36\n\n10\n\n5\n\nq=0.36\n\nq=0.41\n\n6\n\n4\n\n2\n\n0\n\n2\n\n0\n-2\n\n0\n\n2\n\n0\n-2\n\n0\n\n2\n\n0\n-5\n\n0\n\n5\n\n(d) Female speech\n\nFigure 3: Waveforms of four sound sources, examples of the learned basis functions (5 were\nchosen out of 64), and the corresponding coef\ufb01cient distributions modeled by generalized\nGaussians. The full set of basis functions is available at the website also.\n\n20\n\n10\n\nm\nu\nr\nt\nc\ne\np\ns\nr\ne\nw\no\nP\n \ne\ng\na\nr\ne\nv\nA\n\n0\n\n0\n\n1000\n\nRock \nJazz \nMale \nFemale\n\n2000\n\n3000\n\n Frequency (Hz)\n\n4000\n\n4 Experiments and Discussion\n\nFigure 4: Average powerspectra of the 4\nsound sources. Frequency scale ranges in\n\nsampled at 8kHz. The powerspectra are\naveraged and represented in the -axis.\n\n0\u0002 4kHz (\n -axis), since all the signals are\n\nWe have tested the performance of the proposed method on the single channel mixtures of\nfour different sound types. They were monaural signals of rock and jazz music, male and\nfemale speech. We used different sets of speech signals for learning basis functions and for\ngenerating the mixtures. For the mixture generation, two sentences of the target speakers\n\u2018mcpm0\u2019 and \u2018fdaw0\u2019, one for each, were selected from the TIMIT speech database. The\ntraining set consisted of 21 sentences for each gender, 3 for each of randomly chosen 7\nmales (or females) from the same database excluding the 2 target speakers. Rock music\nwas mainly composed of guitar and drum sounds, and jazz was generated by a wind in-\nstrument. Vocal parts of both music sounds were excluded. All signals were downsampled\nto 8kHz, from original 44.1kHz (music) and 16kHz (speech) data. The training data were\nsegmented in 64 samples (8ms) starting at every sample. Audio \ufb01les for all the experiments\nare accessible at the website1.\nFigure 3 displays the actual sources, adapted basis functions, and their coef\ufb01cient distribu-\ntions. Music basis functions exhibit consistent amplitudes with harmonics, and the speech\nbasis functions are similar to Gabor wavelets. Figure 4 compares 4 sources by the average\nspectra. Each covers all the frequency bands, although they are different in amplitude. One\nmight expect that simple \ufb01ltering or masking cannot separate the mixed sources clearly.\n\nBefore actual separation, the source signals were initialized to the values of mixture signal:\nto satisfy Equation 1. The adaptation was repeated\nmore than 300 steps on each sample, and the scaling factors were updated every 10 steps.\nTable 1 reports the signal-to-noise ratios (SNRs) of the mixed signal (\n\u0001 ) and the recovered\nIn terms of total SNR increase the\nmixtures containing music were recovered more cleanly than the male-female mixture.\nSeparation of jazz music and male speech was the best, and the waveforms are illustrated\n\n\u001a were alll\n\u001a ) with the original sources ( \n\n1 http://speech.kaist.ac.kr/\u02dcjangbal/ch1bss/\n\n\u001a ).\n\n\u0001 , and the initial \u0006\n\nresults (^\n\n\u0001\n\u001a\n\u0004\n\n\u0013\n\n \n\u0001\n\u0001\n\u001a\n\u0004\n\u0006\n\u001a\n\n\u0001\n\f5\n\n0\n\n\u22125\n\n5\n\n0\n\n\u22125\n\n2.5\n\n z1\n\n3\n\nTime (sec)\n\n3.5\n\n4\n\n2.5\n\n z2\n\n3\n\nTime (sec)\n\n3.5\n\n4\n\n5\n\n0\n\n\u22125\n\n5\n\n0\n\n\u22125\n\n z1+z2\n3\n\n2.5\n\nTime (sec)\n\n3.5\n\n4\n\n5\n\n0\n\n\u22125\n\n2.5\n\n ez1\n\n3\n\nTime (sec)\n\n3.5\n\n4\n\n2.5\n\n ez2\n\n3\n\nTime (sec)\n\n3.5\n\n4\n\nFigure 5: Separation result for the mixture of jazz music and male speech. In the vertical\n\norder: original sources ( \n\n\b and \n\n\u000f ), mixed signal ( \n\nin Figure 5. We conjecture by the average spectra of the sources in Figure 4 that although\nthere exists plenty of overlap between jazz and speech, the structures are dissimilar, i.e. the\nfrequency components of jazz change less, so we were able to obtain relatively good SNR\nresults. However rock music exhibits scattered spectrum and less characteristical structure.\nThis explains the relatively poorer performances of rock mixtures.\n\n\u000f ), and the recovered signals.\n\nIt is very dif\ufb01cult to compare a separation method with other CASA techniques, because\ntheir approaches are so different in many ways that an optimal tuning of their parameters\nwould be beyond the scope of this paper. However, we compared our method with Wiener\n\ufb01ltering [4], that provides optimal masking \ufb01lters in the frequency domain if true spectro-\ngram is given. So, we assumed that the other source was completely known. The \ufb01lters\nwere computed every block of 8 ms (64 samples), 0.5 sec, and 1.0 sec. In this case, our\nblind results were comparable in SNR with results obtained when the Wiener \ufb01lters were\ncomputed at 0.5 sec.\n\nIn summary, our method has several advantages over traditional approaches to signal sep-\naration. They involve either spectral techniques [5, 6] or time-domain nonlinear \ufb01ltering\ntechniques [3, 4]. Spectral techniques assume that sources are disjoint in the spectrogram,\nwhich frequently result in audible distortions of the signal in the region where the assump-\ntion mismatches. Recent time-domain \ufb01ltering techniques are based on splitting the whole\nsignal space into several disjoint subspaces. Although they overcome the limit of spectral\nrepresentation, they consider second-order statistics only, such as autocorrelation, which\nrestricts the separable cases to orthogonal subspaces [4].\n\nOur method avoids these strong assumptions by utilizing a prior set of basis functions that\ncaptures the inherent statistical structures of the source signal. This generative model there-\nfore makes use of spectral and temporal structures at the same time. The constraints are\ndictated by the ICA algorithm that forces the basis functions to result in an ef\ufb01cient rep-\nresentation, i.e. the linearly independent source coef\ufb01cients; and both, the basis functions\n\nTable 1: SNR results.\nvalues are measured in dB. \u2018Mix\u2019 columns are the sources that are mixed to\ncalculated SNR of mixed signal (\n\n) and recovered sources (\n\nR, J, M, F\n\nstand for rock, jazz music, male, and female speech. All the\n\n, and \u2018\n\n# \u2019s are the\n\u000b\r\f ) with the original sources (\u000b\u000e\f\u0010\u000f\u0012\u0011\u0013\f\u0015\u0014\u0016\f ).\n\n\u0003\u0005\u0004\u0007\u0006\t\b\n\nsnr\n\nsnr\n\nMix\n\nsnr\n\nsnr\n\nR + J\n-3.7\nR + M -3.7\nR + F\n-3.9\n\n\u0002\u0007\u0019\n3.3\n3.1\n2.2\n\n3.7\n3.7\n3.9\n\n\u0002\u001b\u001a\n7.0\n6.8\n6.1\n\nMix\n\nTotal\ninc.\nJ + M 0.1\n10.3\n-0.1\n9.9\nJ + F\n8.3 M + F\n-0.2\n\n\u0002\u001c\u0019\n5.6\n5.1\n2.5\n\n-0.1\n0.1\n0.2\n\n\u0002\u001b\u001a\n5.5\n5.3\n2.7\n\nTotal\ninc.\n11.1\n10.4\n5.2\n\n\b\n\n \n\n\u0001\n\u0002\n\u0002\n\n\u0017\nH\n\u0017\nF\n\u0017\nH\n\u0017\nF\n\u0018\n\u0018\n\u0018\n\u0018\n\fand their corresponding pdf are key to obtaining a faithful MAP based inference algorithm.\nAn important question is how well the traing data has to match the test data. We have also\nperformed experiments with the set of basis functions learned from the test sounds and the\nSNR decreased on average by 1dB.\n\n5 Conclusions\n\nWe presented a technique for single channel source separation utilizing the time-domain\nICA basis functions. Instead of traditional prior knowledge of the sources, we exploited\nthe statistical structures of the sources that are inherently captured by the basis and its\ncoef\ufb01cients from a training set. The algorithm recovers original sound streams through\ngradient-ascent adaptation steps pursuing the maximum likelihood estimate, contraint by\nthe parameters of the basis \ufb01lters and the generalized Gaussian distributions of the \ufb01l-\nter coef\ufb01cients. With the separation results, we demonstrated that the proposed method\nis applicable to the real world problems such as blind source separation, denoising, and\nrestoration of corrupted or lost data. Our current research includes the extension of this\nframework to perform model comparision to estimate which set of basis functions to use\ngiven a dictionary of basis functions. This is achieved by applying a variational Bayes\nmethod to compare different basis function models to select the most likely source. This\nmethod also allows us to cope with other unknown parameters such the as the number of\nsources. Future work will address the optimization of the learning rules towards real-time\nprocessing and the evaluation of this methodology with speech recognition tasks in noisy\nenvironments, such as the AURORA database.\n\nReferences\n\n[1] G. J. Brown and M. Cooke, \u201cComputational auditory scene analysis,\u201d Computer\n\nSpeech and Language, vol. 8, no. 4, pp. 297\u2013336, 1994.\n\n[2] P. Comon, \u201cIndependent component analysis, A new concept?,\u201d Signal Processing,\n\nvol. 36, pp. 287\u2013314, 1994.\n\n[3] E. Wan and A. T. Nelson, \u201cNeural dual extended kalman \ufb01ltering: Applications in\nspeech enhancement and monaural blind signal separation,\u201d in Proc. of IEEE Work-\nshop on Neural Networks and Signal Processing, 1997.\n\n[4] J. Hopgood and P. Rayner, \u201cSingle channel signal separation using linear time-varying\n\ufb01lters: Separability of non-stationary stochastic signals,\u201d in Proc. ICASSP, vol. 3,\n(Phoenix, Arizona), pp. 1449\u20131452, March 1999.\n\n[5] S. T. Roweis, \u201cOne microphone source separation,\u201d Advances in Neural Information\n\nProcessing Systems, vol. 13, pp. 793\u2013799, 2001.\n\n[6] S. Rickard, R. Balan, and J. Rosca, \u201cReal-time time-frequency based blind source\nseparation,\u201d in Proc. of International Conference on Independent Component Analysis\nand Signal Separation (ICA2001), (San Diego, CA), pp. 651\u2013656, December 2001.\n\n[7] T.-W. Lee and G.-J. Jang, \u201cThe statistical structures of male and female speech sig-\n\nnals,\u201d in Proc. ICASSP, (Salt Lake City, Utah), May 2001.\n\n[8] A. J. Bell and T. J. Sejnowski, \u201cLearning the higher-order structures of a natural\n\nsound,\u201d Network: Computation in Neural Systems, vol. 7, pp. 261\u2013266, July 1996.\n\n[9] T.-W. Lee and M. S. Lewicki, \u201cThe generalized Gaussian mixture model using ICA,\u201d\nin International Workshop on Independent Component Analysis (ICA\u201900), (Helsinki,\nFinland), pp. 239\u2013244, June 2000.\n\n[10] B. Pearlmutter and L. Parra, \u201cA context-sensitive generalization of ICA,\u201d in Proc.\n\nICONIP, (Hong Kong), pp. 151\u2013157, September 1996.\n\n\f", "award": [], "sourceid": 2224, "authors": [{"given_name": "Gil-jin", "family_name": "Jang", "institution": null}, {"given_name": "Te-Won", "family_name": "Lee", "institution": null}]}