{"title": "Latent Gaussian Activity Propagation: Using Smoothness and Structure to Separate and Localize Sounds in Large Noisy Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 3465, "page_last": 3474, "abstract": "We present an approach for simultaneously separating and localizing\nmultiple sound sources using recorded microphone data. Inspired by topic\nmodels, our approach is based on a probabilistic model of inter-microphone\nphase differences, and poses separation and localization as a Bayesian\ninference problem. We assume sound activity is locally smooth across time,\nfrequency, and location, and use the known position of the microphones to\nobtain a consistent separation. We compare the performance of our method\nagainst existing algorithms on simulated anechoic voice data and find that it\nobtains high performance across a variety of input conditions.", "full_text": "Latent Gaussian Activity Propagation: Using\n\nSmoothness and Structure to Separate and Localize\n\nSounds in Large Noisy Environments\n\nDaniel D. Johnson, Daniel Gorelik, Ross Mawhorter, Kyle Suver, Weiqing Gu\n\nDepartment of Mathematics\n\nHarvey Mudd College\nClaremont, CA 91711\n\n{ddjohnson, dgorelik, rmawhorter, ksuver, gu}@hmc.edu\n\nSteven Xing, Cody Gabriel, Peter Sankhagowit\n\nIntel Corporation\n\nHillsboro, OR 97124\n\n{steven.xing, cody.gabriel, peter.sankhagowit}@intel.com\n\nAbstract\n\nWe present an approach for simultaneously separating and localizing multiple\nsound sources using recorded microphone data. Inspired by topic models, our\napproach is based on a probabilistic model of inter-microphone phase differences,\nand poses separation and localization as a Bayesian inference problem. We assume\nsound activity is locally smooth across time, frequency, and location, and use\nthe known position of the microphones to obtain a consistent separation. We\ncompare the performance of our method against existing algorithms on simulated\nanechoic voice data and \ufb01nd that it obtains high performance across a variety of\ninput conditions.\n\n1\n\nIntroduction\n\nTraditional playback of real-world events is usually constrained to the viewpoints of the original\ncameras and microphones, which limits the immersion of the experience. In contrast, if those events\nare reconstructed in virtual space, they can be played back from perspectives without a corresponding\nsource recording and explored interactively. Such technology would enable users to experience events\nin virtual reality with full freedom of motion.\nRealistically reproducing the audio of a physical event in virtual space requires that sounds be both\nfaithfully separated from each other and accurately localized. Furthermore, in order to capture\nevents that happen over a large region in space, it is necessary to place microphones far away from\neach other, which introduces non-negligible delays into the audio signals. As such, we focus on\nthe task of performing blind separation and localization in the presence of noise, where relatively\nfew microphones are placed far away from each other to cover a large area. This situation poses\ndif\ufb01culties for some existing separation and localization algorithms, which often assume large\nnumbers of microphones [Dorfan et al., 2015], an array of closely spaced microphones [Mandel et al.,\n2009, Lewis, 2012], or known sound characteristics [Old\ufb01eld et al., 2015, Wang and Chen, 2017].\nSpeci\ufb01cally, we examine the situation of sound sources in a large, known, anechoic two-dimensional\nspace, with pairs of two omni-directional microphones placed at arbitrary points and orientations.\nThe presence of noise, as well as the potentially large distances between microphone pairs, means\nthat sources may not perfectly correspond across recordings, making separation an ill-posed problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe propose a method for simultaneously separating and localizing sounds by performing Bayesian\nmaximum a posteriori inference in an approximate probabilistic model of sound propagation. We\nmodel the phase differences between the microphones in each pair as arising from a latent source\nactivity array. A Gaussian process prior is used to ensure that it is locally smooth across time,\nfrequency, and spatial extent, and microphone locations are handled using a set of propagation\ntransformations. This approach is capable of separating sounds in a noisy environment, regardless of\nthe number of sources present, and localizing each source to a subset of a grid of possible locations.\nOur algorithm makes the window-disjoint orthogonality assumption (W-DO), i.e., that the time-\nfrequency representations of the sound sources do not overlap [Jourjine et al., 2000]. This allows us\nto model each element of the time-frequency spectrum as being generated by exactly one source at a\nspeci\ufb01c location. In particular, we model the phase difference as arising from a mixture of von Mises\ndistributions, with each distribution corresponding to the phase differences for an individual location.\nWe evaluate our method on synthetic data composed of multiple speakers located in a large space with\nadded white noise, and compare the results against those achieved by MESSL [Mandel et al., 2009]\nand DALAS [Dorfan et al., 2015] for variable numbers of sources and microphone con\ufb01gurations.\n\n1.1 Background Information\n\nOur approach \u2013 along with many other source separation algorithms \u2013 uses the Short-Time Fourier\nTransform (STFT) representation to analyze the different sounds present in a signal. The STFT\nis the result of applying the Fourier Transform to short overlapping time windows, and gives a\nrepresentation of what frequencies are present in a signal in those windows.\nBased on the speed of sound, there is a delay in the sound propagation as each source propagates to\neach microphone. This produces a phase difference between the sounds recorded by the microphones\nin each pair. Since the pair of microphones is suf\ufb01ciently close, the phase delay for a given source\nfrequency at a given time is the multiplicative factor ei!, where ! is the sound frequency, and is\nthe arrival delay between the two microphones [Rickard, 2007].\nThe mapping from delay to phase difference is invertible when the distance between the pair of\nmicrophones is bounded by \u21e1c/!m, where !m is the maximum source frequency and c is the speed\nof sound [Rickard, 2007]. This gives an estimate of the true location of the active source. For larger\nseparations and higher frequencies, multiple delays correspond to the same phase difference, making\nthe correct location ambiguous, and requiring the use of additional assumptions to correctly separate\nsounds on the basis of phase differences.\n\n1.2 Prior Work\n\nBlind source separation is a well-studied problem. However, many existing techniques (such as Non-\nnegative Matrix Factorization [Virtanen, 2007], Beamforming [Lewis, 2012], Independent Component\nAnalysis [Hyv\u00e4rinen et al., 2004, Bell and Sejnowski, 1995, Yeredor, 2001], deep-learning based\napproaches [Wang and Chen, 2017], and a heuristic sound-speci\ufb01c approach by Old\ufb01eld et al. [2015])\nmake strong assumptions about the spectral structure of the sources, geometry of the microphones,\nand level of noise, which make them unsuitable for separating our mixtures of interest.\nThe Degenerate Unmixing Estimation Technique (DUET) focuses on separating degenerate mixtures\nof sparse sounds that occur across space from two recordings [Rickard, 2007]. It assumes that the\nsounds do not overlap signi\ufb01cantly in the STFT time-frequency representation, so that different\nsources are dominant at different frequencies at each moment in time. This assumption is known\nas window-disjoint orthogonality (W-DO). DUET uses the phase differences in the STFT across\nmicrophones to estimate the time difference of arrival between the two microphones for that source,\nwhich can be used to approximate the direction from which the sound arrived. It then clusters\nthese directions to construct a series of masks that can isolate a large number of spatially separated\nsounds given only two side-by-side recordings. However, this approach is limited in that it assumes a\nnoiseless environment, and can only accurately estimate time differences for low frequencies due to\nthe aliasing effect of high-frequency waveforms.\nThe Model-based EM Source Separation and Localization (MESSL) algorithm, proposed by Mandel\net al. [2009], builds upon DUET by using a probabilistic model that predicts phase differences\ngiven true source delays, and separates sounds by performing maximum-likelihood estimation of\n\n2\n\n\fN\n\nN\n\n0\n\n\u2303\u21b5\n\n0\n\n\u2303\n\nexp\n\n\u21b5\n\nT\u21e5F\u21e5S\n\nmul\n\nA\n\nfprop\n\nA0\n\nfnorm\n\nT\u21e5F\u21e5S\u21e5L\n\nT 0\u21e5F\u21e5S\u21e5L\n\n\n\nS\u21e5L\n\nsoftmax\n\nCat\n\np\n\nS\u21e5L\n\n`\n\ns\n\n8` 2 L\n\uf8ff\n\"\n\n\u00b5\n\nvM /U\n\n\n\n8t 2 T 0\n8f 2 F\n8m 2 M\n\nFigure 1: Directed factor graph diagram [Dietz, 2010] of the LGAP model. Circles represent\nrandom variables, diamonds represent deterministic functions of parent variables, and unmarked\nsymbols denote hyperparameters. Along edges, small black squares represent directed factors (i.e.,\nconditional probability distributions of children given parents), small white diamonds represent\nfunction application, and the small black circle represents a choice \u201cgate\u201d (see Dietz [2010]). Plate\nnotation indicates sets of independent variables, and array-valued variables are labeled below with\ntheir index sets. N denotes a multivariate normal distribution, Cat denotes a categorical (multinomial)\ndistribution, and vM /U represents the von Mises-Uniform mixture described in Section 2.3.\n\nthe delay parameters. This approach avoids the aliasing problem, as the mapping from true delay to\nphase difference is one-to-one. From this premise they identify the parts of the spectrogram of the\nsignal which best \ufb01t models being constructed of the mixture using an Expectation-Maximization\n(EM) method. Although the original algorithm focuses on a single microphone pair and assumes\neach time-frequency component is independent, MESSL has been extended to incorporate local\nsmoothness using a Markov random \ufb01eld [Mandel and Roman, 2015] and to work with more than\none microphone pair [Bagchi et al., 2015]. However, these extensions focus on the use of MESSL\nfor separating speech mixtures in small environments, and were not designed for use in large noisy\nenvironments with distant microphones.\nThe Distributed Algorithm for Localization and Separation (DALAS) extends MESSL to work with\nspatially-separated microphones in a known con\ufb01guration [Dorfan et al., 2015]. The \ufb01rst step in\nthis approach is to run an incremental distributed expectation-maximization (IDEM) algorithm to\n\ufb01nd the maximum likelihood estimate of the location distribution of the sources, using the known\ncon\ufb01guration of microphones to model the possible phase differences associated with each spatial\nposition. Next, it associates each peak of the localization distribution with a source, and matches\neach source to its closest microphone pair. Finally, a spectral mask is created using thresholding and\nthe node values are then \ufb01ltered to give the separated sources. This technique has the advantage of\nworking with spatially separated microphones over a large area. However, it was not designed for\nnoisy environments, and assumes that every time-frequency component is independent, ignoring the\ntemporal and frequential structure of each sound source.\n\n2 Probabilistic Model\n\nWe cast source separation and localization as a Bayesian inference problem. We model each\ntime-frequency bin as being assigned a latent dominating source and location that determines the\ndistribution of each observed phase difference. These assignments are in turn drawn proportional to a\nsmooth latent activity array A, which causes the assignments for nearby observations to be correlated.\nTo account for arrival time differences between microphones, this latent activity is corrected for each\nmicrophone using a propagation function before being used to determine the dominating locations.\nUsing this model, which we call Latent Gaussian Activity Propagation (LGAP), sounds can then be\nseparated by performing Bayesian inference on the latent source and location assignments.\nLet T be a set of time bins, F a set of frequency bins, L a set of candidate locations, and M\na set of microphone pairs. A directed factor graph diagram [Dietz, 2010] of the LGAP model\nis shown in Figure 1. We assume the sound was generated by a small set of sources S, and let\n\n3\n\n\fA 2 R|T|\u21e5|F|\u21e5|S|\u21e5|L| be an array-valued random variable such that At,f,s,` is proportional to the\nlikelihood of hearing a sound from source s at location `, time t, and frequency f. We also assume that\nthe location of each source does not change with time or frequency, and thus express A as a product\nof time-frequency activity and spatial extent terms, denoted \u21b5 2 R|T|\u21e5|F|\u21e5|S| and 2 R|S|\u21e5|L|.\nThis activity array is propagated through time for each microphone according to location-dependent\ntime delays, resulting in a propagated location-activity array for each microphone. Next, each time-\nfrequency bin of the observed audio is assigned a single dominating source and location, proportional\nto the propagated activity of each. Finally, the phase difference observations are sampled from a\nlocation-speci\ufb01c phase distribution.\n\n2.1 Sound Activity\nWe encourage the latent activity array to be smooth by imposing Gaussian process priors over the\nrandom variables \u21b5 and . In particular, we interpret each element of \u21b5 and as evaluations of\na function over the index sets for those variables, i.e., we interpret \u21b5 2 R|T|\u21e5|F|\u21e5|S| as a random\nfunction \u21b5 : T \u21e5 F \u21e5 S ! R such that \u21b5(t, f, s) yields the activity of source s at time t and\nfrequency f, and specify the covariance matrix \u2303\u21b5 for \u21b5 by evaluating a chosen Gaussian process\nkernel over the set T \u21e5 F \u21e5 S of possible times, frequencies, and sources, yielding multivariate\nnormal distribution \u21b5 \u21e0 N (0, \u2303\u21b5). By choosing these kernels to favor smooth functions, we encode\nour prior belief that the latent activity arrays are smooth over speci\ufb01c time scales, frequency ranges,\nor spatial areas.\nIn practice, |T|,|F|, and |L| may be very large, leading to memory and computation issues. To\nef\ufb01ciently enforce our smoothness priors over high dimensional spaces, we choose Gaussian process\nkernels that factorize into combinations of axis-speci\ufb01c (positive de\ufb01nite) kernel functions, i.e.,\n\nk\u21b5\u21e3(t1, f1, s1), (t2, f2, s2)\u2318 = kT (t1, t2) kF (f1, f2) kS1(s1, s2) + kS2(s1, s2),\n\nk\u21e3(s1, `1), (s2, `2)\u2318 = kL(`1, `2)kS1(s1, s2),\n\n(1)\n\n(2)\n\nComputations involving the resulting multivariate normal distributions require factorizing each\ncovariance matrix as \u2303 = SST (where S is called a scale matrix) and multiplying vectors by these\nmatrices. Fortunately, the axis-aligned structure of our kernels makes these multiplications ef\ufb01cient.\nProposition 1. Suppose \u2303\u21b5 : R(|T|\u21e5|F|\u21e5|S|)\u21e5(|T|\u21e5|F|\u21e5|S|) and \u2303 : R(|S|\u21e5|L|)\u21e5(|S|\u21e5|L|) are covari-\nance matrices arising from the evaluation of kernels (1) and (2) on a grid of points. There exist\nfactorizations \u2303\u21b5 = S\u21b5ST\n such that multiplying vectors by S\u21b5 and S (and their\n\ninverses) can be performed in O|T||F||S| (|T| +|F| +|S|) and O|S||L| (|S| +|L|) arithmetic\n\noperations, respectively.\n\n\u21b5 , \u2303 = SST\n\nSpeci\ufb01c kernels for kT , kF , and kL can be chosen based on prior knowledge about the application\ndomain. For our experiments, we use rational quadratic kernels with length-scales of 0.1 seconds\nand 1000 Hz for kT and kF , respectively, with the \u21b5 parameter set to 0.1, and use a mixture of two\nequally-likely radial basis function kernels for kL, with length scales of 3/32 and 5/8 times the size\nof our test region.\nWe assume that individual sources are independent, i.e., we de\ufb01ne kS1(s1, s2) = s1s2 to avoid\nenforcing any structure across sources. Additionally, we set kS2(s1, s2) = Cs1s2, where C is chosen\nto account for the magnitude of differences in average activity between sources. In our experiments,\nwe set C = 3 and similarly scale up the rational quadratic and radial basis function kernels to have\nmaximum value 3, as we found that this produced a plausible prior distribution of output assignments\nafter normalization (described in Section 2.3).\nSince we interpret the activity matrix A as being proportional to the contribution of each source and\nlocation, we need to ensure that its elements are nonnegative. We thus compute\n\nAt,f,s,` = exp(\u21b5t,f,s)\n\nexp(s,`)\n\nP`0 exp(s,`0)\n\n.\n\nNote that the contribution is normalized across locations for each source, which prevents sources\nthat are spread across multiple locations (or have uncertain location estimates during inference) from\nbeing proportionally \"louder\" and dominating the \ufb01nal source assignments.\n\n4\n\n\f2.2 Propagation\n\nIn a large environment, it is likely that sounds arrive at each microphone pair with a delay greater\nthan the resolution of our STFT time bins. The activity matrix A must thus be corrected for each\nmicrophone pair location, by shifting activity from far locations forward in time so that they are heard\nafter the appropriate delay.\nFor a \ufb01xed location ` and microphone pair m, let \u2327m,` denote the propagation time delay between a\nsound wave leaving ` and arriving at m, and let \u2327 ` denote a reference delay for each location. We\ncompute A0m = f m\n\nprop(A) as\n\nA0m,t2,f,s,` = Xt12T\n\nw(t2 t1 \u2327m,` + \u2327 `) At1,f,s,`,\n\nwhere w : R ! R is a weighting function that peaks at zero. This shifts the values of A along the\ntime axis by \u2327m,` \u2327 ` bins, the delay (in STFT frames) of microphone m relative to the reference\ndelay for a sound at location `, and also blurs it slightly across bins to account for uncertainty in\nlocation and timing or misalignment between discrete timesteps. This operation can be ef\ufb01ciently\nimplemented as a convolution operation with a precomputed convolution \ufb01lter. In practice, we let\n\u2327 ` = \u2327m?, ` for some chosen reference microphone pair m?, and choose w based on \ufb01nite differences\nof a logistic sigmoid function\n\nw(t) =(\n\n1\n\n1+exp((t+0.5)/) \n0\n\n1\n\n1+exp((t0.5)/) 3 \uf8ff t \uf8ff 3,\n\notherwise,\n\nwhere is chosen based on how close points in L are each other (i.e., the delay uncertainty due to\nour discretization of the region).\nDue to the different offsets for each microphone, sounds that arrive at the microphones over a\nparticular time interval may be explained by activity that occurs outside of that interval. The\npropagation transformation thus transforms our original set of timesteps T for our latent activity to a\nslightly smaller set of timesteps T 0 corresponding to the observed data.\n\n2.3 Assignments and Observations\n\nAfter propagation, we normalize the activity array A0 across sources and locations, obtaining source-\nlocation probabilities\n\npm,t,f,s,` =\n\nA0m,t,f,s,`\n\nPs02S,`02L A0m,t,f,s0,`0\n\n.\n\nWe then sample latent source and location assignments for each microphone pair m, time t, and\nfrequency f, representing the source and location that dominated the recording for that microphone\npair at that moment. In particular, we sample sm,t,f , `m,t,f \u21e0 Cat(pm,t,f ), so that\n\nP (sm,t,f = s ^ `m,t,f = `) = pm,t,f,s,`.\n\nGiven the location `m,t,f that dominates a given time-frequency bin, we can predict the phase\ndifference m,t,f received by a given microphone pair m. Let m,` denote the difference between\nthe distance of the \ufb01rst microphone in the pair to ` and the distance of the second microphone to `,\nm,` denote our uncertainty in the value of m,` due to our discrete set of locations, and c denote the\nspeed of sound. We model m,t,f as being drawn from a von Mises distribution vM(\u00b5m,f,`, \uf8ffm,f,`)\nwith probability 1 \u270fm,f,`, and from U(0, 2\u21e1) with probability \u270fm,f,`, where\n\n\u00b5m,f,` = 2\u21e1f\n\nm,`\n\nc\n\nmod 2\u21e1,\n\n\uf8ffm,f,` =\u27132\u21e1f\n\nm,`\n\nc \u25c62\n\n,\n\n\u270fm,f,` = 0.001.\n\nFor ef\ufb01ciency, we group the individual raw STFT bins into small regions, and assume all observations\nwithin each region were generated by the same von Mises/Uniform mixture. This allows our T and F\nsets to be smaller than the number of true STFT components computed while still using all available\nphase information.\n\n5\n\n\fInference\n\n2.4\nGiven a set of known phase observations , we can separate and localize the sounds by performing\nBayesian inference on the assignments sm,t,f and `m,t,f for each of our data points. We focus on\napproximating the maximum a posteriori (MAP) estimate of the parameters, i.e., the most likely\nseparation given our observations.\nWe approximate the MAP solution by marginalizing out sm,t,f and `m,t,f and performing gradient\nascent on the log posterior\n\nlog P (\u21b5, |) = log P (\u21b5) + log P () + log P (|\u21b5, ) log P ().\n\nNote that, due to our smoothness assumption, the covariance matrices \u2303\u21b5 and \u2303 are poorly con-\nditioned, with small eigenvalues along directions of rapid oscillation. These values can hamper\nconvergence of gradient descent, as the log-posterior term contains products with the inverse co-\nvariance matrices, leading to steep valleys in the loss landscape. We thus perform gradient descent\nin the eigenspace of the covariance matrices, and use a preconditioning matrix to scale the small\neigenvalues so that the minimum eigenvalue of the system is 1. We also employ gradient clipping\nto prevent the algorithm from diverging. For computational ef\ufb01ciency, we only consider a random\nsubset of frequencies at each iteration of gradient descent, since this gives an unbiased estimate of\nthe true gradient.\nAfter obtaining estimates of \u21b5 and , we can then estimate the posterior probabilities\nP (sm,t,f , `m,t,f|\u21b5, , ) by computing\nThese probabilities can be used to assign portions of the input to speci\ufb01c sources and locations.\n\nP (sm,t,f , `m,t,f|\u21b5, , ) / P (sm,t,f , `m,t,f|\u21b5, )P (|`m,t,f ).\n\n2.5 Garbage Source\nMicrophone-speci\ufb01c noise, reverberations, and interactions between sources can produce phase\nobservations that are not consistent with any true source location. To enable the model to handle these\nnon-localized components, we follow Mandel et al. [2009] and designate one of the sources sg 2 S\nas a \u201cgarbage source\u201d. This source is constrained so that, instead of being assigned to a speci\ufb01c\n2D coordinate, it is instead assigned to a \u201cgarbage location\u201d `g 2 L with no associated time delay\n(m,`g = 0), and whose corresponding phase distribution is uniform over [0, 2\u21e1) (so \u270fm,f,`g = 1).\nWhen using a garbage source, we let range only over S \\ {sg} and L \\ {`g}, and compute\n\nAt,f,s,` =8><>:\n\nexp(\u21b5t,f,s)\nexp(\u21b5t,f,s)\n0\n\nP`06=`g\n\nexp(s,`)\n\nexp(s,`0 )\n\n` 6= `g, s 6= sg,\n` = `g, s = sg,\notherwise,\n\ni.e., we constrain sg to assign all of its activity to `g and all other sources to assign none to `g.\n\nInitialization\n\n2.6\nThe LGAP model is fairly sensitive to the initialization of the parameters \u21b5 and . Following\nMandel et al. [2009], we start by computing the PHAT-histogram [Aarabi, 2002] of the input for each\nmicrophone, and then identify peaks in the histogram to estimate potential source time delays. In our\ncase, we choose the microphone pair with the most cleanly-identi\ufb01ed peaks as m?, and use the peaks\nfor that pair to initialize our source activity array \u21b5. In particular, we compute the likelihood of each\nphase observation under a set of von Mises distributions centered at each peak (along with a uniform\ndistribution for the garbage source), then smooth these likelihoods over time and frequency using a\nGaussian blur, and initialize \u21b5 from the logarithm of these smoothed likelihoods. This ensures that\nour initial propagated activity array A0 assigns each of the peaks of the reference microphone to a\ndistinct source. is initialized uninformatively to an array of zeros.\nTo optimize our parameters from this initialization, we start by holding \u21b5 \ufb01xed and running gradient\nascent on our posterior probabilities with respect to . This gives us an initial estimate of the location\nof each of the sources identi\ufb01ed from our initialization. Next, we hold constant, reset \u21b5 to zero, and\nrun gradient ascent on \u21b5 to re-estimate our time-frequency masks to be consistent with the identi\ufb01ed\nlocations without being biased by the approximate PHAT-histogram masks. Finally, we perform\ngradient ascent for a longer duration on both \u21b5 and to \ufb01ne-tune our full activity matrix.\n\n6\n\n\fFigure 2: Generated masks for one of our test examples, consisting of two sources recorded by \ufb01ve\nmicrophone pairs over a 32 m by 32 m square. Top row: separated spectrograms of the recordings\nfrom the center microphone pair, over four seconds (horizontal axis) across frequencies up to 22050\nHz (vertical axis). Bottom row: location estimates, where shading represents con\ufb01dence and crosses\nrepresent predicted point locations. For LGAP, location estimates are colored by source, since LGAP\nestimates a separate location distribution for each.\n\n3 Experiments\n\nWe evaluate our algorithm on a set of synthetic anechoic speech mixtures using audio from the\nMozilla Common Voice Dataset1. For each test, we place sound sources randomly within a square\nregion, and compute the waveforms received by each of the microphones, which are arranged parallel\nto one side of this region with a pair spacing of 5 cm. To prevent unrealistic phase observations due\nto inaudible contributions of sources at high frequencies, and to demonstrate robustness against noise,\nwe add a small amount of white noise to each of our microphones.\nFor our method, we use rational quadratic kernels for our time and frequency kernels and a mixture\nof radial basis functions for our location kernel, as described in Section 2.1. We enable the garbage\nsource to handle the additional microphone noise, and group STFT bins into rectangular regions\nconsisting of 8 timesteps and 2 frequencies for computational ef\ufb01ciency.\nWe compare our method with a number of baselines:\n\u2022 MESSL [Mandel et al., 2009, Mandel and Roman, 2015]: Although MESSL has been extended to\nwork with many-channel recordings [Bagchi et al., 2015], it assumes all microphones are adjacent\nand does not give location estimates. We thus run MESSL individually on each of our microphone\npairs, using MRF smoothing and a garbage source. We do not incorporate level differences, as\nthese are not used by the other methods and are intended for use with a dummy head placed\nbetween the pairs, which we do not simulate here.\n\n\u2022 DALAS [Dorfan et al., 2015]: For consistency with our method, we modify DALAS to use an\nidentical von Mises distribution of phase offsets for each location, and select a \ufb01xed number of\nsources to separate instead of choosing it dynamically.\n\n\u2022 Independent tied-microphone model: A simple baseline, based on the DUET algorithm, that\nassumes that the corresponding time-frequency bin for each microphone pair was generated by a\nsingle location `m,t,f = `t,f (i.e., the same source dominates across all microphones, ignoring time\ndelays and noise), and that these locations were independently chosen and uniformly distributed\n(ignoring smoothness over time and frequency). We calculate the posterior distribution over these\nlocations explicitly as\n\nwhere P (m,t,f|`t,f ) is the same von Mises-Uniform mixture described in Section 2.3. We then\nchoose a \ufb01xed number of locations with the highest likelihood averaged across all bins and use\nthose as the sources to extract.\n1https://voice.mozilla.org/\n\nP (`t,f|1,t,f , . . . , M,t,f ) / P (`t,f )\n\nP (m,t,f|`t,f ) /\n\nMYm=1\n\nP (m,t,f|`t,f )\n\nMYm=1\n\n7\n\nFrequencyTimeY PositionX Position\fDALAS, MESSL\nFigure 3: Experimental results for each method:\n(All mics), MESSL (Best mic),\nIdeal Mask. Lines represent mean performance and shaded\nregions represent one standard deviation, computed across all sources separately for each of \ufb01ve trials.\nLocation squared error is reported relative to the side length of the square region. Region size is in\nmeters. MESSL performance is reported averaged across all microphones as well as for the single\nmicrophone pair with the best separation. For reference, Ideal Mask gives the performance of the\nideal ratio mask [Srinivasan et al., 2006].\n\nIndep. Tied,\n\nLGAP,\n\nFor all methods, we compute the STFT using a Hann window with a frame size of 512 samples\nand a step of 128 samples (for a 44100 Hz signal). Additionally, for all methods except MESSL,\nwe specify the set L of possible locations as a 40 by 40 grid covering this square region, along\nwith a garbage location. (The DALAS and independent-tied methods do not process this garbage\nlocation separately, and instead simply treat it as another possible location.) For the DALAS and\nindependent-tied methods, after selecting our set of sources, we re-normalize the masks across those\nsources so that mask values for each time-frequency bin sum to one, effectively conditioning on that\nbin being generated at one of the (\ufb01xed number of) most likely source locations. We then divide\nby the maximum mask value for each source to amplify sources detected with low con\ufb01dence. See\nFigure 2 for an example of the resulting masks and location estimates corresponding to each method.\nWe evaluate the performance of each algorithm on a suite of tests, varying three parameters: the\nnumber of microphone pairs (from 1 to 10), the number of sources (from 1 to 8), and the size of the\nsquare region (from 4 m to 512 m along each side). We conduct \ufb01ve experiments for each set of\nparameters, varying each parameter while holding the others at default values (5 microphone pairs, 3\nsources, 32 m).\nTo evaluate the localization performance of each method, we compute the squared error of the\nestimate, normalized by the size of the square region. To evaluate separation performance, we\ncompare the masks produced by each algorithm to the ideal ratio mask [Srinivasan et al., 2006],\nwhich assigns bins to sources proportionally to the ratio of the power spectral density of each source.\nLetting R denote the ratio mask, M denote a separation method\u2019s mask, and P denote the total power\nspectral density of the recording, we compute bin precision, bin recall, power precision, and power\nrecall for a source s as\n\nBPs = Pm,t,f Mm,t,f,sRm,t,f,s\nPm,t,f Mm,t,f,s\nPPs = Pm,t,f Mm,t,f,sRm,t,f,sPm,t,f,s\nPm,t,f Mm,t,f,sPm,t,f,s\n\n,\n\n,\n\nBRs = Pm,t,f Mm,t,f,sRm,t,f,s\nPm,t,f Rm,t,f,s\nPRs = Pm,t,f Mm,t,f,sRm,t,f,sPm,t,f,s\nPm,t,f Rm,t,f,sPm,t,f,s\n\n,\n\n,\n\nBin precision quanti\ufb01es how much of the proposed mask corresponds to the source, and bin recall\nquanti\ufb01es how much of the source is recovered by the mask. Power precision and recall are weighted\nby the audio power, and thus quantify how much of the energy passed through the mask corresponds\n\n8\n\n\fto the source and how much of the source energy is recovered by the mask. Note that power precision\nis a (nonlinear) transformation of the signal-to-interference ratio. Since the true sources have no\nde\ufb01nite ordering, we assign each true source to the proposed source that attains the highest bin\nprecision. In addition to computing these metrics for each of the methods, we also compute them for\nthe ratio mask itself. Since MESSL was computed for each microphone pair separately, we report\nboth the mean performance across microphone pairs and the performance on the microphone pair for\nwhich MESSL performs the best.\nFigure 3 shows the results of our experiments. We see that LGAP is competitive with the other\nmethods across all metrics. Note that the independent tied-microphone method has high precision,\nbut low recall, as its strong assumptions cause it to erroneously classify large fractions of the input as\nnoise, whereas MESSL obtains high recall but low precision, indicating that it recovers most of each\nsource but also admits some interference. LGAP, on the other hand, is able to maintain both high\nprecision and high recall. In addition, and of particular importance for virtual reality applications,\nthe LGAP model obtains reliably accurate location estimates, and consistently outperforms the other\nlocalization methods across all input conditions.\nInterestingly, the independent tied-microphone baseline performs quite well, especially in small\nregions and with few sources. This suggests that, when propagation delays are small and there\nare only a few events, independent consideration of the phase shifts at each time-frequency bin is\nsuf\ufb01cient to obtain good source masks and location estimates. However, as region size grows and\nmore sources are added, the time delays cause different sources to be active at the same time, reducing\nthe effectiveness of this baseline method.\nLGAP maintains high performance even with a small number of microphones. In addition, as the\nnumber of microphones increases, LGAP is able to combine information across microphones and\nimprove its precision markedly. As the number of sources are increased, all methods show decreased\nperformance, but LGAP maintains relatively high precision across large numbers of sources. And as\nthe region size grows, the LGAP method is able to account for the increased propagation delays and\nmaintain high performance.\n\n4 Conclusion\n\nWe have described a Bayesian method for sound separation and localization that incorporates known\nmicrophone positions and smoothness assumptions to improve separation quality. This method is\nrobust to a variety of input conditions, and can combine information from distant microphones even\nin the presence of signi\ufb01cant time delays. Apart from the smoothness assumption, the method does\nnot depend on source statistics. It thus has the potential to be used for a variety of applications, and is\nparticularly suited to capturing audio events that are distributed across large real-world environments.\nThe experiments and analysis presented here use a simple microphone con\ufb01guration and synthetic\ndataset, and focus on how much of the sound is preserved and correctly separated by each method.\nIt would be interesting to compare the methods using other metrics, such as BSS_EVAL, which\ndecomposes the error into interference and artifact components [Vincent et al., 2006], or PEASS,\nwhich estimates the subjective quality of the separation using the decomposed error [Emiya et al.,\n2011]. Also of relevance would be to evaluate the methods using different microphone geometries,\ntypes of audio, and distribution of noise. For instance, one could conduct an experiment using\ndiffuse noise that is correlated at low frequencies to mimic more realistic noise \ufb01elds [Habets and\nGannot, 2007], or set up physical microphones to measure performance in a non-synthetic task. These\nexperiments are left to future work.\nThere is considerable room to extend the LGAP model presented here by imposing different structural\nrestrictions on the latent activity arrays. For instance, it would be possible to explicitly model sources\nwith different spatial, temporal and frequential characteristics by using distinct covariance matrices\nfor each, or modify the activity matrix to handle sources that move over time. Additionally, it would\nbe interesting to explore methods for making inference more ef\ufb01cient, such as by exploring simpler\nlocal smoothness priors (instead of a global Gaussian process) or discretizing the set of possible phase\nobservations. The masks generated by LGAP could be post-processed to further improve separation,\nfor instance by using beamforming to combine audio across all microphones. Finally, LGAP could\neasily be extended to work with more complex microphone geometries instead of pairs, which could\nimprove both separation and localization performance.\n\n9\n\n\fReferences\nParham Aarabi. Self-localizing dynamic microphone arrays. IEEE Transactions on Systems, Man,\n\nand Cybernetics, Part C (Applications and Reviews), 32(4):474\u2013484, 2002.\n\nDeblin Bagchi, Michael I Mandel, Zhongqiu Wang, Yanzhang He, Andrew Plummer, and Eric Fosler-\nLussier. Combining spectral feature mapping and multi-channel model-based source separation for\nnoise-robust automatic speech recognition. In Automatic Speech Recognition and Understanding\n(ASRU), 2015 IEEE Workshop on, pages 496\u2013503. IEEE, 2015.\n\nAnthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation\n\nand blind deconvolution. Neural computation, 7(6):1129\u20131159, 1995.\n\nLaura Dietz. Directed factor graph notation for generative models. Max Planck Institute for\n\nInformatics, Tech. Rep, 2010.\n\nYuval Dorfan, Dani Cherkassky, and Sharon Gannot. Speaker localization and separation using\nincremental distributed expectation-maximization. In Signal Processing Conference (EUSIPCO),\n2015 23rd European, pages 1256\u20131260. IEEE, 2015.\n\nValentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann. Subjective and objective\nquality assessment of audio source separation. IEEE Transactions on Audio, Speech, and Language\nProcessing, 19(7):2046\u20132057, 2011.\n\nEmanu\u00ebl AP Habets and Sharon Gannot. Generating sensor signals in isotropic noise \ufb01elds. The\n\nJournal of the Acoustical Society of America, 122(6):3464\u20133470, 2007.\n\nAapo Hyv\u00e4rinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John\n\nWiley & Sons, 2004.\n\nAlexander Jourjine, Scott Rickard, and Ozgur Yilmaz. Blind separation of disjoint orthogonal\nsignals: Demixing n sources from 2 mixtures. In Acoustics, Speech, and Signal Processing, 2000.\nICASSP\u201900. Proceedings. 2000 IEEE International Conference on, volume 5, pages 2985\u20132988.\nIEEE, 2000.\n\nJerad Lewis. Microphone array beamforming. Analog Devices, AN1140, 2012.\nMichael I Mandel and Nicoleta Roman. Enforcing consistency in spectral masks using Markov\nrandom \ufb01elds. In Signal Processing Conference (EUSIPCO), 2015 23rd European, pages 2028\u2013\n2032. IEEE, 2015.\n\nMichael I Mandel, Ron J Weiss, and Daniel P W Ellis. Model-based expectation-maximization source\nseparation and localization. IEEE Transactions on Audio, Speech, and Language Processing, 17\n(8), 2009.\n\nRobert Old\ufb01eld, Ben Shirley, and Jens Spille. Object-based audio for interactive football broadcast.\n\nMultimedia Tools and Applications, 74(8):2717\u20132741, 2015.\n\nScott Rickard. The DUET Blind Source Separation Algorithm, pages 217\u2013241. Springer Netherlands,\nDordrecht, 2007. ISBN 978-1-4020-6479-1. doi: 10.1007/978-1-4020-6479-1_8. URL https:\n//doi.org/10.1007/978-1-4020-6479-1_8.\n\nSoundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. Binary and ratio time-frequency\n\nmasks for robust speech recognition. Speech Communication, 48(11):1486\u20131501, 2006.\n\nEmmanuel Vincent, R\u00e9mi Gribonval, and C\u00e9dric F\u00e9votte. Performance measurement in blind audio\nsource separation. IEEE transactions on audio, speech, and language processing, 14(4):1462\u20131469,\n2006.\n\nTuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with\ntemporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language\nprocessing, 15(3):1066\u20131074, 2007.\n\nDeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview.\n\narXiv preprint arXiv:1708.07524, 2017.\n\nArie Yeredor. Blind source separation with pure delay mixtures. In Proc. ICA, pages 522\u2013527, 2001.\n\n10\n\n\f", "award": [], "sourceid": 1776, "authors": [{"given_name": "Daniel", "family_name": "Johnson", "institution": "Harvey Mudd College"}, {"given_name": "Daniel", "family_name": "Gorelik", "institution": "Harvey Mudd College"}, {"given_name": "Ross", "family_name": "Mawhorter", "institution": "Harvey Mudd College"}, {"given_name": "Kyle", "family_name": "Suver", "institution": "Harvey Mudd College"}, {"given_name": "Weiqing", "family_name": "Gu", "institution": "Harvey Mudd College"}, {"given_name": "Steven", "family_name": "Xing", "institution": "Intel Corporation"}, {"given_name": "Cody", "family_name": "Gabriel", "institution": "Intel Corporation"}, {"given_name": "Peter", "family_name": "Sankhagowit", "institution": "Intel Corporation"}]}