Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
Hagai Attias
Source separation is an important problem at the intersection of several fields, including machine learning, signal processing, and speech tech- nology. Here we describe new separation algorithms which are based on probabilistic graphical models with latent variables. In contrast with existing methods, these algorithms exploit detailed models to describe source properties. They also use subband filtering ideas to model the reverberant environment, and employ an explicit model for background and sensor noise. We leverage variational techniques to keep the compu- tational complexity per EM iteration linear in the number of frames.
1 The Source Separation Problem
Fig. 1 illustrates the problem of source separation with a sensor array. In this problem, signals from K independent sources are received by each of L (cid:21) K sensors. The task is to extract the sources from the sensor signals. It is a difficult task, partly because the received signals are distorted versions of the originals. There are two types of distortions. The first type arises from propagation through a medium, and is approximately linear but also history dependent. This type is usually termed reverberations. The second type arises from background noise and sensor noise, which are assumed additive. Hence, the actual task is to obtain an optimal estimate of the sources from data. The task is difficult for another reason, which is lack of advance knowledge of the properties of the sources, the propagation medium, and the noises. This difficulty gave rise to adaptive source separation algorithms, where parameters that are related to those properties are adjusted to optimized a chosen cost function.
Unfortunately, the intense activity this problem has attracted over the last several years [1–9] has not yet produced a satisfactory solution. In our opinion, the reason is that existing tech- niques fail to address three major factors. The first is noise robustness: algorithms typically ignore background and sensor noise, sometime assuming they may be treated as additional sources. It seems plausible that to produce a noise robust algorithm, noise signals and their properties must be modeled explicitly, and these models should be exploited to compute optimal source estimators. The second factor is mixing filters: algorithms typically seek, and directly optimize, a transformation that would unmix the sources. However, in many situations, the filters describing medium propagation are non-invertible, or have an unstable inverse, or have a stable inverse that is extremely long. It may hence be advantageous to
Figure 1: The source separation problem. Signals from K = 2 speakers propagate toward L = 2 sensors. Each sensor receives a linear mixture of the speaker signals, distorted by multipath propagation, medium response, and background and sensor noise. The task is to infer the original signals from sensor data.
estimate the mixing filters themselves, then use them to estimate the sources. The third factor is source properties: algorithms typically use a very simple source model (e.g., a one time point histogram). But in many cases one may easily obtain detailed models of the source signals. This is particularly true for speech sources, where large datasets exist and much modeling expertise has developed over decades of research. Separation of speakers is also one of the major potential commercial applications of source separation algorithms. It seems plausible that incorporating strong source models could improve performance. Such models may potentially have two more advantages: first, they could help limit the range of possible mixing filters by constraining the optimization problem. Second, they could help avoid whitening the extracted signals by effectively limiting their spectral range to the range characteristic of the source model.
This paper makes several contributions to the problem of real world source separation. In the following, we present new separation algorithms that are the first to address all three factors. We work in the framework of probabilistic graphical models. This framework allows us to construct models for sources and for noise, combine them with the reverberant mixing transformation in a principled manner, and compute parameter and source estimates from data which are Bayes optimal. We identify three technical ideas that are key to our approach: (1) a strong speech model, (2) subband filtering, and (3) variational EM.
2 Frames, Subband Signals, and Subband Filtering
We start with the concept of subband filtering. This is also a good point to define our notation. Let xm denote a time domain signal, e.g., the value of a sound pressure waveform at time point m = 0; 1; 2; :::. Let Xn[k] denote the corresponding subband signal at time frame n and subband frequency k. The subband signals are obtained from the time domain signal by imposing an N-point window wm, m = 0 : N (cid:0) 1 on that signal at equally spaced points nJ, n = 0; 1; 2; :::, and FFT-ing the windowed signal,