{"title": "Cocktail Party Processing via Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 224, "page_last": 232, "abstract": "While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.", "full_text": "Cocktail Party Processing via Structured Prediction\n\nYuxuan Wang1, DeLiang Wang1,2\n\n1Department of Computer Science and Engineering\n\n2Center for Cognitive Science\n\nThe Ohio State University\n\nColumbus, OH 43210\n\n{wangyuxu,dwang}@cse.ohio-state.edu\n\nAbstract\n\nWhile human listeners excel at selectively attending to a conversation in a cocktail\nparty, machine performance is still far inferior by comparison. We show that the\ncocktail party problem, or the speech separation problem, can be effectively ap-\nproached via structured prediction. To account for temporal dynamics in speech,\nwe employ conditional random \ufb01elds (CRFs) to classify speech dominance within\neach time-frequency unit for a sound mixture. To capture complex, nonlinear re-\nlationship between input and output, both state and transition feature functions\nin CRFs are learned by deep neural networks. The formulation of the problem\nas classi\ufb01cation allows us to directly optimize a measure that is well correlated\nwith human speech intelligibility. The proposed system substantially outperforms\nexisting ones in a variety of noises.\n\n1 Introduction\n\nThe cocktail party problem, or the speech separation problem, is one of the central problems in\nspeech processing. A particularly dif\ufb01cult scenario is monaural speech separation, in which mix-\ntures are recorded by a single microphone and the task is to separate the target speech from its\ninterference. This is a severely underdetermined \ufb01gure-ground separation problem, and has been\nstudied for decades with limited success.\n\nResearchers have attempted to solve the monaural speech separation problem from various angles.\nIn signal processing, speech enhancement (e.g., [1, 2]) has been extensively studied, and assump-\ntions regarding the statistical properties of noise are crucial to its success. Model-based methods\n(e.g., [3]) work well in constrained environments, and source models need to be trained in advance.\nComputational auditory scene analysis (CASA) [4] is inspired by how human auditory system func-\ntions [5]. CASA has the potential to deal with general acoustic environments but existing systems\nhave limited performance, particularly in dealing with unvoiced speech.\n\nRecent studies suggest a new formulation to the cocktail party problem, where the focus is to clas-\nsify whether a time-frequency (T-F) unit is dominated by the target speech [6]. Motivated by this\nviewpoint, we propose to approach the monaural speech separation problem via structured predic-\ntion. The use of structured predictors, as opposed to binary classi\ufb01ers, is motivated by temporal\ndynamics in speech signal. Our study makes the following contributions: (1) we demonstrate that\nmodeling temporal dynamics via structured prediction can signi\ufb01cantly improve separation; (2) to\ncapture nonlinearity, we propose a new structured prediction model that makes use of the discrimi-\nnative feature learning power of deep neural networks; and (3) instead of classi\ufb01cation accuracy, we\nshow how to directly optimize a measure that is well correlated with human speech intelligibility.\n\n1\n\n\f2 Separation as binary classi\ufb01cation\n\nWe aim to estimate a time-frequency matrix called the ideal binary mask (IBM). The IBM is a binary\nmatrix constructed from premixed target and interference, where 1 indicates that the target energy\nexceeds the interference energy by a local signal-to-noise (SNR) criterion (LC) in the corresponding\nT-F unit, and 0 otherwise. The IBM is de\ufb01ned as:\n\nIBM (t, f ) =(cid:26) 1, if SN R(t, f ) > LC\n\n0, otherwise,\n\nwhere SN R(t, f ) denotes the local SNR (in decibels) within the T-F unit at time t and frequency\nf . We adopt the common choice of LC = 0 in this paper [7]. Despite its simplicity, adopting the\nIBM as a computational objective offers several advantages. First, the IBM is directly based on\nthe auditory masking phenomenon whereby a stronger sound tends to mask a weaker one within a\ncritical band. Second, unlike other objectives such as maximizing SNR, it is well established that\nlarge human speech intelligibility improvements result from IBM processing, even for very low SNR\nmixtures [7\u20139]. Improving human speech intelligibility is considered as a gold standard for speech\nseparation. Third, IBM estimation naturally leads to classi\ufb01cation, which opens the cocktail party\nproblem to a plethora of machine learning techniques.\n\nWe propose to formulate IBM estimation as binary classi\ufb01cation as follows, which is a form of\nsupervised learning. A sound mixture with the 16 kHz sampling rate is passed through a 64-channel\ngammatone \ufb01lterbank spanning from 50 Hz to 8000 Hz on the equivalent rectangular bandwidth\nrate scale. The output from each \ufb01lter channel is divided into 20-ms frames with 10-ms frame shift,\nproducing a cochleagram [4]. Due to different spectral properties of speech, a subband classi\ufb01er\nis trained for each \ufb01lter channel independently, with the IBM providing training labels. Acoustic\nfeatures for each subband classi\ufb01er are extracted from T-F units in the cochleagram. The target\nspeech is separated by binary weighting of the cochleagram using the estimated IBM [4].\n\nSeveral recent studies have attempted to directly estimate the IBM via classi\ufb01cation. By employing\nGaussian mixture models (GMMs) as classi\ufb01ers and amplitude modulation spectrograms (AMS)\nas features, Kim et al. [10] show that estimated masks can improve human speech intelligibility in\nnoise. Han and Wang [11] have improved Kim et al.\u2019s system by employing support vector machines\n(SVMs) as classi\ufb01ers. Wang et al. [12] propose a set of complementary acoustic features that shows\nfurther improvements over previous systems. The complementary feature is a concatenation of\nAMS, relative spectral transform and perceptual linear prediction (RASTA-PLP), mel-frequency\ncepstral coef\ufb01cients (MFCC), and pitch-based features.\n\nBecause the ratio of 1\u2019s to 0\u2019s in the IBM is often skewed, simply using classi\ufb01cation accuracy as\nthe evaluation criterion may not be appropriate. Speech intelligibility studies [9, 10] have evaluated\nthe in\ufb02uence of the hit (HIT) and false-alarm (FA) rate on intelligibility scores. The difference, the\nHIT\u2212FA rate, is found to be well correlated to human speech intelligibility in noise [10]. The HIT\nrate is the percent of correctly classi\ufb01ed target-dominant T-F units (1\u2019s) in the IBM, and the FA rate\nis the percent of wrongly classi\ufb01ed interference-dominant T-F units (0\u2019s). Therefore, it is desirable\nto design a separation algorithm that maximizes HIT\u2212FA of the output mask.\n\n3 Proposed system\n\nDictated by speech production mechanisms, the IBM contains highly structured, rather than, random\npatterns. Previous systems do not explicitly model such structure. As a result, temporal dynamics,\nwhich is a fundamental characteristic of speech, is largely ignored in previous work. Separation\nsystems accounting for temporal dynamics exist. For example, Mysore et al. [13] incorporate tem-\nporal dynamics using HMMs. Hershey et al. [14] consider different levels of dynamic constraints.\nHowever, these works do not treat separation as classi\ufb01cation. Contrary to standard binary clas-\nsi\ufb01ers, structured prediction models are able to model correlations in the output. In this paper, we\ntreat unit classi\ufb01cation at each \ufb01lter channel as a sequence labeling problem and employ linear-chain\nconditional random \ufb01elds (CRFs) [15] as subband classi\ufb01ers.\n\n2\n\n\f3.1 Conditional random \ufb01elds\n\nDifferent from HMM, a CRF is a discriminative model and does not need independence assumptions\nof features, making it more suitable to our task. A CRF models the posterior probability P (y|x) as\nfollows. Denoting y as a label sequence and x as an input sequence,\n\nP (y|x) =\n\nexp(cid:0)Pt wT f (y, x, t)(cid:1)\n\nZ(x)\n\n.\n\n(1)\n\nHere t indexes time frames, w is the parameters to learn, and Z(x) =Py\n\nis the partition function. f is a vector-valued feature function associated with each local site (T-F\nunit in our task), and often categorized into state feature functions s(yt, x, t) and transition feature\nfunctions t(yt\u22121, yt, x, t). State feature functions de\ufb01ne the local discriminant functions for each\nT-F unit and transition feature functions capture the interaction between neighboring labels. We\nassume a linear-chain setting and the \ufb01rst-order Markovian property, i.e., only interactions between\ntwo neighboring units in time are modeled. In our task, we can simply use acoustic feature vectors\nin each T-F unit as state feature functions and their concatenations as transition feature functions:\n\n\u2032 exp(cid:0)Pt wT f (y\u2032, x, t)(cid:1)\n\ns(yt, x, t) = [\u03b4(yt=0)xt, \u03b4(yt=1)xt]T ,\n\nt(yt\u22121, yt, x, t) = [\u03b4(yt\u22121=yt)zt, \u03b4(yt\u221216=yt)zt]T ,\n\n(2)\n\n(3)\n\nwhere \u03b4 is the indicator function and zt = [xt\u22121, xt]T . Equation (3) essentially encodes temporal\ncontinuity in the IBM. To simplify notations, all feature functions are written as f (yt\u22121, yt, x, t) in\nthe remainder of the paper.\n\nTraining is for estimating w, and is usually done by maximizing the conditional log-likelihood on a\n\ntraining set T =(cid:8)(cid:0)x(m), y(m)(cid:1)(cid:9), i.e., we seek w by\n\nlog p(y(m)|x(m), w) + R(w),\n\n(4)\n\nmax\n\nw Xm\n\nwhere m is the index of a training sample, and R(w) is a regularizer of w (we use \u21132 in this paper).\nFor gradient ascent, a popular choice is the limited-memory BFGS (L-BFGS) algorithm [16].\n\n3.2 Nonlinear expansion using deep neural networks\n\nA CRF is a log-linear model, which has only linear modeling power. As acoustic features are\ngenerally not linearly separable, the direct use of CRFs unlikely produces good results.\nIn the\nfollowing, we propose a method to transform the standard CRF into a nonlinear sequence classi\ufb01er.\n\nWe employ pretrained deep neural networks (DNNs) to capture nonlinearity between input and\noutput. DNNs have received widespread attention since Hinton et al.\u2019s paper [17]. DNNs can be\nviewed as hierarchical feature detectors that learn increasingly complex feature mappings as the\nnumber of hidden layers increases. To deal with problems such as vanishing gradients, Hinton et\nal. suggest to \ufb01rst pretrain a DNN using a stack of restricted Boltzmann machines (RBMs) in a\nunsupervised and layerwise fashion. The resulting network weights are then supervisedly \ufb01netuned\nby backpropagation.\n\nWe \ufb01rst train DNN in the standard way to classify speech dominance in each T-F unit. After pre-\ntraining and supervised \ufb01netuning, we then take the last hidden layer representations from the DNN\nas learned features to train the CRF. In a discriminatively trained DNN, the weights from the last\nhidden layer to the output layer would de\ufb01ne a linear classi\ufb01er, hence the last hidden layer represen-\ntations are more amenable to linear classi\ufb01cation. In other words, we replace x by h in equations\n(1)-(4), where h represents the learned hidden features. This way CRFs would greatly bene\ufb01t from\nthe nonlinear modeling power of deep architectures.\n\nTo better encode local contextual information, we could use a window (across both time and fre-\nquency) of learned features to label the current T-F unit. A more parsimonious way is to use a\nwindow of posteriors estimated by DNNs as features to train the CRF, which can dramatically re-\nduce the dimensionality. We note in passing that the correlations across both time and frequency\ncan also be encoded at the model level, e.g., by using grid-structured CRFs. However the decoding\nalgorithm may substantially increase the computational complexity of the system.\n\n3\n\n\fWe want to point out that an important advantage of using neural networks for feature learning is\nits ef\ufb01ciency in the test phase; once trained, the nonlinear feature extraction of DNN is extremely\nfast (only involves forward pass). This is, however, not always true for other methods. For exam-\nple, sparse coding may need to solve a new optimization problem to get the features. Test phase\nef\ufb01ciency is crucial for real-time implementation of a speech separation system.\n\nThere is related work on developing nonlinear sequence classi\ufb01ers in the machine learning commu-\nnity. For example, van der Maaten et al. [18] and Morency et al. [19] consider incorporating hidden\nvariables into the training and inference in CRF. Peng et al. [20] investigate a combination of neural\nnetworks and CRFs. Other related studies include [21] and [22]. The proposed model differs from\nthe previous methods in that (1) discriminatively trained deep architecture is used, and/or (2) a CRF\ninstead of a Viterbi decoder is used on top of a neural network for sequence labeling, and/or (3)\nnonlinear features are also used in modeling transitions. In addition, the use of a contextual window\nand the change of the objective function discussed in the next subsection is speci\ufb01cally tailored to\nthe speech separation problem.\n\n3.3 Maximizing HIT\u2212FA rate\n\nAs argued before, it is desirable to train a classi\ufb01er to maximize the HIT\u2212FA rate of the estimated\nmask. In this subsection, we show how to change the objective function and ef\ufb01ciently calculate the\ngradients in CRF. Since subband classi\ufb01ers are used, we aim to maximize the channelwise HIT\u2212FA.\n\nDenote the output label as ut \u2208 {0, 1} and the true label as yt \u2208 {0, 1}. The per utterance HIT\u2212FA\n\nrate can be expressed asPt utyt/Pt yt \u2212Pt ut(1 \u2212 yt)/Pt(1 \u2212 yt), where the \ufb01rst term is the\n\nHIT rate and the second the FA rate. To make the objective function differentiable, we replace ut by\nthe marginal probability p(yt = 1|x), hence we seek w by maximizing the HIT\u2212FA on a training\nset:\n\nt = 1|x(m), w)y(m)\n\nt\n\nt = 1|x(m), w)(1 \u2212 y(m)\n\nt\n\nmax\n\nw PmPt p(y(m)\n\nPmPt y(m)\n\nt\n\n\u2212 PmPt p(y(m)\n\nPmPt(1 \u2212 y(m)\n\nt\n\n)\n\nClearly, computing the gradient of (5) boils down to computing the gradient of the marginal. A\nspeech utterance (sentence) typically spans several hundreds of time frames, therefore numerical\nstability is critically important in our task. As can be seen later, computing the gradient of the\nmarginal requires the gradient of forward/backward scores. We adopt Rabiner\u2019s scaling trick [23]\nused in HMM to normalize the forward/backward score at each time point. Speci\ufb01cally, de\ufb01ne\n\u03b1(t, u) and \u03b2(t, u) as the forward and backward score of label u at time t, respectively. We nor-\n\nmalize the forward score such thatPu \u03b1(t, u) = 1, and use the resulting scaling to normalize the\nbackward score. De\ufb01ning potential function \u03c6t(v, u) = exp(cid:0)wT f (v, u, x, t)(cid:1), the recurrence of the\n\nnormalized forward/backward score is written as,\n\n)\n\n! .\n\n(5)\n\n\u03b1(t \u2212 1, v)\u03c6t(v, u)/s(t),\n\n\u03b2(t + 1, v)\u03c6t(u, v)/s(t + 1),\n\n(6)\n\n(7)\n\n\u03b1(t, u) = Xv\n\u03b2(t, u) = Xv\nwhere s(t) = PuPv \u03b1(t \u2212 1, v)\u03c6t(v, u).\n\nthe marginal has a simpler form of p(yt|x, w) = \u03b1(t, yt)\u03b2(t, yt). Therefore, the gradient of the\nmarginal is,\n\nIt is easy to show that Z(x) = Qt s(t), and now\n\n\u2202p(yt|x, w)\n\n\u2202w\n\n= G\u03b1(t, yt)\u03b2(t, yt) + \u03b1(t, yt)G\u03b2(t, yt),\n\n(8)\n\nwhere G\u03b1 and G\u03b2 are the gradients of the normalized forward and backward score, respectively.\nDue to score normalization, G\u03b1 and G\u03b2 will very unlikely over\ufb02ow. We now show that G\u03b1 can be\n\ncalculated recursively. Let q(t, u) =Pv \u03b1(t \u2212 1, v)\u03c6t(v, u), we have\n\u2202w Pv q(t, v) \u2212Pv\n(Pv q(t, v))2\n\nG\u03b1(t, u) =\n\n\u2202\u03b1(t, u)\n\n\u2202q(t,u)\n\nand,\n\n\u2202w\n\n=\n\n\u2202q(t,v)\n\n\u2202w\n\nq(t, u)\n\n,\n\n(9)\n\n\u2202q(t, u)\n\n\u2202w\n\n=Xv\n\nG\u03b1(t \u2212 1, v)\u03c6t(v, u) +Xv\n\n4\n\n\u03b1(t \u2212 1, v)\u03c6t(v, u)f (v, u, x, t).\n\n(10)\n\n\f \n\n \n\n \n\nA\nF\n\u2212\nT\nH\n\nI\n\nA\nF\n\u2212\nT\nH\n\nI\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n \n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n \n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n(a) matched: overall\n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n(d) unmatched: overall\n\nFigure 1: HIT\u2212FA results.\ncondition.\n\nA\nF\n\u2212\nT\nH\n\nI\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n55\n\n \n0\n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nChannel\n\nA\nF\n\u2212\nT\nH\n\nI\n\nA\nF\n\u2212\nT\nH\n\nI\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n \n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n \n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n(b) matched: voiced\n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n \n\n \n\nA\nF\n\u2212\nT\nH\n\nI\n\nA\nF\n\u2212\nT\nH\n\nI\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n \n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n45\n\n40\n\n \n\n \n\n \n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n(c) matched: unvoiced\n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n\u221210 dB\n\n\u22125 dB\n\n0 dB\n\n(e) unmatched: voiced\n\n(f) unmatched: unvoiced\n\n(a)-(c): matched-noise test condition; (d)-(f): unmatched-noise test\n\nA\nF\n\u2212\nT\nH\n\nI\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n \n0\n\n \n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nChannel\n\nA\nF\n\u2212\nT\nH\n\nI\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n \n0\n\n \n\nDNN\n*\nDNN\n\nDNN\u2212CRF\n*\nDNN\u2212CRF\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nChannel\n\n(a) overall\n\n(b) voiced speech intervals\n\n(c) unvoiced speech intervals\n\nFigure 2: Channelwise HIT\u2212FA comparisons on the 0 dB test mixtures.\n\nThe derivation of G\u03b2 is similar, thus omitted. The time complexity of calculating G\u03b1 and G\u03b2 is\nO(L|S|2), where L and |S| are the utterance length and the size of the label set, respectively. This\nis the same as the forward-backward recursion.\n\nThe objective function in (5) is not concave. Since high accuracy correlates with high HIT\u2212FA, a\nsafe practice is to use a solution from (4) as a warm start for the subsequent optimization of (5). For\nfeature learning, DNN is also trained using (5) in the \ufb01nal system. The gradient calculation is much\nsimpler due to the absence of transition features. We found that L-BFGS performs well and shows\nfast and stable convergence for both feature learning and CRF training.\n\n4 Experimental results\n\n4.1 Experimental setup\n\nOur training and test sets are primarily created from the IEEE corpus [24] recorded by a single fe-\nmale speaker. This enables us to directly compare with previous intelligibility studies [10], where\nthe same speaker is used in training and testing. The training set is created by mixing 50 utter-\nances with 12 noises at 0 dB. To create the test set, we choose 20 unseen utterances from the same\nspeaker. First, the 20 utterances are mixed with the previous 12 noises to create a matched-noise test\n\n5\n\n\f60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nl\n\ne\nn\nn\na\nh\nC\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nl\n\ne\nn\nn\na\nh\nC\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nl\n\ne\nn\nn\na\nh\nC\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\nFrame\n\nFrame\n\nFrame\n\n(a) Ideal binary mask\n\n(b) DNN-CRF\u2217-P mask\n\n(c) DNN mask\n\nFigure 3: Masks for a test utterance mixed with an unseen crowd noise at 0 dB. White represents 1\u2019s\nand black represents 0\u2019s.\n\ncondition, then 5 unseen noises to create a unmatched-noise test condition. The test noises1 cover\na variety of daily noises and most of them are highly non-stationary. In each frequency channel,\nthere are roughly 150,000 and 82,000 T-F units in the training and test set, respectively. Speaker-\nindependent experiments are presented in Section 4.4.\n\nThe proposed system is called DNN-CRF or DNN-CRF\u2217 if it is trained to maximize HIT\u2212FA. We\nuse suf\ufb01x R and P to distinguish training features for CRF, where R stands for learned features with-\nout a context window (features are learned from the complementary acoustic feature set mentioned\nin Section 2) and P stands for a window of posterior features. We use a two hidden layer DNN as\nit provides a good trade-off between performance and complexity, and use a context window span-\nning 5 time frames and 17 frequency channels to construct the posterior feature vector. We use the\ncross-entropy objective function for training the standard DNN in comparisons.\n\n4.2 Experiment 1: HIT\u2212FA maximization\n\nIn this subsection, we show the effect of directly maximizing the HIT\u2212FA rate. To evaluate the\ncontribution from the change of the objective alone, we use ideal pitch in the following experiments\nto neutralize pitch estimation errors. The models are trained on 0 dB mixtures. In addition to 0 dB,\nwe also test the trained models on -10 and -5 dB mixtures. Such a test setting not only allows us\nto measure the system\u2019s generalization to different SNR conditions, but also to show the effects of\nHIT\u2212FA maximization on estimating sparse IBMs. We compare DNN-CRF\u2217-R with DNN, DNN\u2217\nand DNN-CRF-R, and the results are shown in Figure 1 and 2.\n\nWe document HIT\u2212FA rates on three levels: overall, voiced intervals (pitched frames) and unvoiced\nintervals (unpitched frames). Voicing boundaries are determined using ideal pitch. Figure 1 shows\nthe results for both matched-noise and unmatched-noise test conditions. First, comparing the perfor-\nmances of DNN-CRFs and DNNs, we can see that modeling temporal continuity always improves\nperformance. It also seems very helpful for generalization to different SNRs. In the matched con-\ndition, the improvement by directly maximizing HIT\u2212FA is most signi\ufb01cant in unvoiced intervals.\nThe improvement becomes larger when SNR decreases. In the unmatched condition, as classi\ufb01ca-\ntion becomes much harder, direct maximization of HIT\u2212FA offers more improvements in all cases.\nThe largest HIT\u2212FA improvement of DNN-CRF\u2217-R over DNN is about 10.7% and 21.2% abso-\nlute in overall and unvoiced speech intervals, respectively. For a closer inspection, Figure 2 shows\nchannelwise HIT\u2212FA comparisons on the 0 dB test mixtures in the matched-noise test condition. It\nis well known that unvoiced speech is indispensable for speech intelligibility but hard to separate.\nDue to the lack of harmonicity and weak energy, frequency channels containing unvoiced speech\noften have signi\ufb01cantly skewed distributions of target-dominant and interference-dominant units.\nTherefore, an accuracy-maximizing classi\ufb01er tends to output all 0\u2019s to attain a high accuracy. As\nan illustration, Figure 3 shows two masks for an utterance mixed with an unseen crowd noise at 0\ndB using DNN and DNN-CRF\u2217-P respectively. The two estimated masks achieve similar accuracy\naround 90%. However, it is clear that the DNN mask misses signi\ufb01cant portions of unvoiced speech,\ne.g., between frame 30-50 and 220-240.\n\n1Test noises are: babble, bird chirp, crow, cocktail party, yelling, clap, rain, rock music, siren, telephone,\n\nwhite, wind, crowd, fan, speech shaped, traf\ufb01c, and factory noise. The \ufb01rst 12 are used in training.\n\n6\n\n\fTable 1: Performance comparisons between different systems. Boldface indicates best result\n\nMatched-noise condition\n\nUnmatched-noise condition\n\nAccuracy HIT\u2212FA SNR (dB) SegSNR (dB) Accuracy HIT\u2212FA SNR (dB) SegSNR (dB)\n\nSystem\n\nGMM [10]\nSVM [11]\n\nDNN\nCRF\n\nSVM-Struct\n\nCNF\n\nLD-CRF\n\nDNN-CRF\u2217-R\nDNN-CRF\u2217-P\n\n77.4%\n55.4%\n86.6%\n68.0%\n87.7%\n71.6%\n82.3%\n59.8%\n81.7%\n58.6%\n87.8%\n71.7%\n86.3%\n68.4%\n89.1%\n75.6%\n89.9% 76.9%\n\n10.2\n10.5\n11.4\n8.8\n8.4\n11.2\n9.7\n12.1\n12.0\n4.6\n3.7\n\n7.3\n10.9\n11.8\n8.7\n8.1\n12.0\n10.5\n13.2\n13.5\n0.5\n-0.7\n\n65.9%\n31.6%\n91.2% 64.1%\n66.2%\n91.1%\n64.0%\n90.8%\n63.5%\n90.7%\n66.9%\n91.1%\n63.6%\n91.1%\n90.8%\n70.2%\n70.7%\n91.1%\n\nn/a\nn/a\n\nn/a\nn/a\n\n6.8\n9.7\n9.9\n9.3\n9.1\n9.8\n8.9\n10.3\n10.0\n6.2\n5.6\n\n1.9\n7.9\n8.1\n7.8\n7.5\n8.4\n7.8\n9.0\n8.9\n1.1\n-0.6\n\nHendriks et al. [1]\nWiener Filter [2]\n\nn/a\nn/a\n\nn/a\nn/a\n\nTable 2: Performance comparisons when tested on different unseen speakers\nUnmatched-noise condition\n\nMatched-noise condition\n\nSystem\n\nAccuracy HIT\u2212FA SNR (dB) SegSNR (dB) Accuracy HIT\u2212FA SNR (dB) SegSNR (dB)\n\nSVM [11]\n\nDNN-CRF\u2217-P\n\n86.2%\n65.0%\n87.3% 72.0%\n\nHendriks et al. [1]\nWiener Filter [2]\n\nn/a\nn/a\n\nn/a\nn/a\n\n10.2\n12.1\n4.5\n3.8\n\n9.9\n11.2\n-2.9\n-4.5\n\n91.1% 60.6%\n68.3%\n90.9%\n\nn/a\nn/a\n\nn/a\nn/a\n\n9.4\n10.1\n6.9\n6.0\n\n7.3\n8.1\n-1.0\n-3.3\n\nIn summary, direct maximization of HIT\u2212FA improves HIT\u2212FA performance compared to accu-\nracy maximization, especially for unvoiced speech, and the improvement is more signi\ufb01cant when\nthe system is tested on unseen acoustic environments.\n\n4.3 Experiment 2: system comparisons\n\nWe systematically compare the proposed system with three kinds of systems on 0 dB mixtures:\nbinary classi\ufb01er based, structured predictor based, and speech enhancement based. In addition to\nHIT\u2212FA, we also include classi\ufb01cation accuracy, SNR and segmental SNR (segSNR) as alterna-\ntive evaluation criteria. To compute SNRs, we use the target speech resynthesized from the IBM as\nthe ground truth signal for all classi\ufb01cation-based systems. This way of computing SNRs is com-\nmonly adopted in the literature [4, 25], as the IBM represents the ground truth of classi\ufb01cation. All\nclassi\ufb01cation-based systems use the same feature set, but with estimated pitch, described in Section\n2, except for Kim et al.\u2019s GMM based system which uses AMS features [10]. Note that we fail\nto produce reasonable results using the complementary feature set in Kim et al.\u2019s system, possibly\nbecause GMM requires much more training data than discriminative models for high dimensional\nfeatures. Results are summarized in Table 1.\n\nWe \ufb01rst compare with methods based on binary classi\ufb01ers. These include two existing systems\n[10, 11] and a DNN based system. Due to the variety of noises, classi\ufb01cation is challenging even\nin the matched-noise condition. It is clear that the proposed system signi\ufb01cantly outperforms the\nothers in terms of all criteria. The improvement of DNN-CRF\u2217s over DNN demonstrates the bene\ufb01t\nof modeling temporal continuity. It is interesting to see that DNN signi\ufb01cantly outperforms SVM,\nespecially for unvoiced speech (not shown) which is important for speech intelligibility. We note\nthat without RBM pretraining, DNN performs signi\ufb01cantly worse. Classi\ufb01cation in the unmatched-\nnoise condition is obviously more dif\ufb01cult, as feature distributions are likely mismatched between\nthe training and the test set. Kim et al.\u2019s system fails to generalize to different acoustic environments\ndue to substantially increased FA rates. The proposed system signi\ufb01cantly outperforms SVM and\nDNN, achieving about 71% overall HIT\u2212FA and 10 dB SNR for unseen noises. Kim et al.\u2019s system\nhas been shown to improve human speech intelligibility [10], it is therefore reasonable to project\nthat the proposed system will provide further speech intelligibility improvements.\n\nWe next compare with systems based on structured predictors, including CRF, SVM-Struct [26],\nconditional neural \ufb01elds (CNF) [20] and latent-dynamic CRF (LD-CRF) [19]. For fair compar-\nisons, we use a two hidden layer CNF model with the same number of parameters as DNN-CRF\u2217s.\nConventional structured predictors such as CRF and SVM-Struct (linear kernel) are able to explic-\nitly model temporal dynamics, but only with linear modeling capability. Direct use of CRF turns\nout to be much worse than using kernel SVM. Nevertheless, the performance can be substantially\n\n7\n\n\fboosted by adding latent variables (LD-CRF) or by using nonlinear feature functions (CNF and\nDNN-CRF\u2217s). With the same network architecture, CNF mainly differs from our model in two as-\npects. First, CNF does not use unsupervised RBM pretraining. Second, CNF only uses bias units in\nbuilding transition features. As a result, the proposed system signi\ufb01cantly outperforms CNF, even\nif CRF and neural networks are jointly trained in the CNF model. With better ability of encoding\ncontextual information, using a window of posteriors as features clearly outperforms single unit\nfeatures in terms of classi\ufb01cation. It is worth noting that although SVM achieves slightly higher\naccuracy in the unmatched-noise condition, the resulting HIT\u2212FA and SNRs are worse than some\nother systems. This is consistent with our analysis in Section 4.2.\n\nFinally, we compare with two representative speech enhancement systems [1, 2]. The algorithm\nproposed in [1] represents a recent state-of-the-art method and Wiener \ufb01ltering [2] is one of the most\nwidely used speech enhancement algorithms. Since speech enhancement does not aim to estimate\nthe IBM, we compare SNRs by using clean speech (not the IBM) as the ground truth. As shown in\nTable 1, the speech enhancement algorithms are much worse, and this is true of all 17 noises.\n\nDue to temporal continuity modeling and the use of T-F context, the proposed system produces\nmasks that are smoother than those from the other systems (e.g., Figure 3). As a result, the outputs\nseem to contain less musical noise.\n\n4.4 Experiment 3: speaker generalization\n\nAlthough the training set contains only a single IEEE speaker, the proposed system generalizes\nreasonably well to different unseen speakers. To show this, we create a new test set by mixing 20\nutterances from the TIMIT corpus [27] at 0 dB. The new test utterances are chosen from 10 different\nfemale TIMIT speakers, each providing 2 utterances. We show the results in Table 2, and it is\nclear that the proposed system generalizes better than existing ones to unseen speakers. Note that\nsigni\ufb01cantly better performance and generalization to different genders can be obtained by including\nthe speaker(s) of interest into the training set.\n\n5 Discussion and conclusion\n\nListening tests have shown that a high FA rate is more detrimental to speech intelligibility than a\nhigh miss (or low HIT) [9]. The proposed classi\ufb01cation framework affords us control over these two\nquantities. For example, we could constrain the upper bound of the FA rate while still maximizing\nthe HIT rate. In this case, a constrained optimization should substitute (5). Our experimental results\n(not shown due to lack of space) indicate that this can effectively remove spurious target segments\nwhile still produce intelligible speech.\n\nBeing able to ef\ufb01ciently compute the derivative of marginals, in principle one could optimize a\nclass of objectives other than HIT\u2212FA. These may include objectives concerning either speech in-\ntelligibility or quality, as long as the objective of interest can be expressed or approximated by a\ncombination of marginal probabilities. For example, we have tried to simultaneously minimize two\ntraditional CASA measures PEL and PN R (see e.g., [25]), where PEL represents the percent of tar-\nget energy loss and PN R the percent of noise energy residue. Signi\ufb01cant reductions in both measures\ncan be achieved compared to methods that maximize accuracy or conditional log-likelihood.\n\nWe have demonstrated that the challenge of the monaural speech separation problem can be ef-\nfectively approached via structured prediction. Observing that the IBM exhibits highly structured\npatterns, we have proposed to use CRF to explicitly model the temporal continuity in the IBM. This\nlinear sequence classi\ufb01er is further transformed to a nonlinear one by using state and transition fea-\nture functions learned from DNN. Consistent with the results from speech perception, we train the\nproposed DNN-CRF model to maximize a measure that is well correlated to human speech intel-\nligibility in noise. Experimental results show that the proposed system signi\ufb01cantly outperforms\nexisting ones and generalizes better to different acoustic environments. Aside from temporal con-\ntinuity, other ASA principles [5] such as common onset and co-modulation also contribute to the\nstructure in the IBM, and we will investigate these in future work.\n\nAcknowledgements. This research was supported in part by an AFOSR grant (FA9550-12-1-0130), an STTR\nsubcontract from Kuzer, and the Ohio Supercomputer Center.\n\n8\n\n\fReferences\n\n[1] R. Hendriks, R. Heusdens, and J. Jensen, \u201cMMSE based noise PSD tracking with low complexity,\u201d in\n\nICASSP, 2010.\n\n[2] P. Scalart and J. Filho, \u201cSpeech enhancement based on a priori signal to noise estimation,\u201d in ICASSP,\n\n1996.\n\n[3] S. Roweis, \u201cOne microphone source separation,\u201d in NIPS, 2001.\n\n[4] D. Wang and G. Brown, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Ap-\n\nplications. Hoboken, NJ: Wiley-IEEE Press, 2006.\n\n[5] A.S. Bregman, Auditory scene analysis: The perceptual organization of sound. The MIT Press, 1994.\n\n[6] D. Wang, \u201cOn ideal binary mask as the computational goal of auditory scene analysis,\u201d in Speech Sepa-\nration by Humans and Machines, Divenyi P., Ed. Kluwer Academic, Norwell MA., 2005, pp. 181\u2013197.\n\n[7] D. Brungart, P. Chang, B. Simpson, and D. Wang, \u201cIsolating the energetic component of speech-on-speech\n\nmasking with ideal time-frequency segregation,\u201d J. Acoust. Soc. Am., vol. 120, pp. 4007\u20134018, 2006.\n\n[8] M. Anzalone, L. Calandruccio, K. Doherty, and L. Carney, \u201cDetermination of the potential bene\ufb01t of\n\ntime-frequency gain manipulation,\u201d Ear and hearing, vol. 27, no. 5, pp. 480\u2013492, 2006.\n\n[9] N. Li and P. Loizou, \u201cFactors in\ufb02uencing intelligibility of ideal binary-masked speech: Implications for\n\nnoise reduction,\u201d J. Acoust. Soc. Am., vol. 123, no. 3, pp. 1673\u20131682, 2008.\n\n[10] G. Kim, Y. Lu, Y. Hu, and P. Loizou, \u201cAn algorithm that improves speech intelligibility in noise for\n\nnormal-hearing listeners,\u201d J. Acoust. Soc. Am., vol. 126, pp. 1486\u20131494, 2009.\n\n[11] K. Han and D. Wang, \u201cAn SVM based classi\ufb01cation approach to speech separation,\u201d in ICASSP, 2011.\n\n[12] Y. Wang, K. Han, and D. Wang, \u201cExploring monaural features for classi\ufb01cation-based speech segrega-\n\ntion,\u201d IEEE Trans. Audio, Speech, Lang. Process., in press, 2012.\n\n[13] G. Mysore and P. Smaragdis, \u201cA non-negative approach to semi-supervised separation of speech from\n\nnoise with the use of temporal dynamics,\u201d in ICASSP, 2011.\n\n[14] J. Hershey, T. Kristjansson, S. Rennie, and P. Olsen, \u201cSingle channel speech separation using factorial\n\ndynamics,\u201d in NIPS, 2007.\n\n[15] J. Lafferty, A. McCallum, and F. Pareira, \u201cConditional random \ufb01elds: probabilistic models for segmenting\n\nand labeling sequence data,\u201d in ICML, 2001.\n\n[16] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 1999.\n\n[17] G. Hinton, S. Osindero, and Y. Teh, \u201cA fast learning algorithm for deep belief nets,\u201d Neural Computation,\n\nvol. 18, no. 7, pp. 1527\u20131554, 2006.\n\n[18] L. van der Maaten, M. Welling, and L. Saul, \u201cHidden-unit conditional random \ufb01elds,\u201d in AISTATS, 2011.\n\n[19] L. Morency, A. Quattoni, and T. Darrell, \u201cLatent-dynamic discriminative models for continuous gesture\n\nrecognition,\u201d in CVPR, 2007.\n\n[20] J. Peng, L. Bo, and J. Xu, \u201cConditional neural \ufb01elds,\u201d in NIPS, 2009.\n\n[21] A. Mohamed, G. Dahl, and G. Hinton, \u201cDeep belief networks for phone recognition,\u201d in NIPS workshop\n\non speech recognition and related applications, 2009.\n\n[22] T. Do and T. Artieres, \u201cNeural conditional random \ufb01elds,\u201d in AISTATS, 2010.\n\n[23] L. Rabiner, \u201cA tutorial on hidden Markov models and selected applications in speech recognition,\u201d Proc.\n\nIEEE, vol. 77, no. 2, pp. 257\u2013286, 2003.\n\n[24] IEEE, \u201cIEEE recommended practice for speech quality measurements,\u201d IEEE Trans. Audio Electroa-\n\ncoust., vol. 17, pp. 225\u2013246, 1969.\n\n[25] G. Hu and D. Wang, \u201cMonaural speech segregation based on pitch tracking and amplitude modulation,\u201d\n\nIEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135\u20131150, 2004.\n\n[26] I. Tsochataridis, T. Hofmann, and T. Joachims, \u201cSupport vector machine for interdependent and structured\n\noutput spaces,\u201d in ICML, 2004.\n\n[27] J. Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus, NIST, 1993.\n\n9\n\n\f", "award": [], "sourceid": 128, "authors": [{"given_name": "Yuxuan", "family_name": "Wang", "institution": null}, {"given_name": "Deliang", "family_name": "Wang", "institution": null}]}