{"title": "Probabilistic Inference of Speech Signals from Phaseless Spectrograms", "book": "Advances in Neural Information Processing Systems", "page_first": 1393, "page_last": 1400, "abstract": "", "full_text": "Probabilistic Inference of Speech Signals from\n\nPhaseless Spectrograms\n\nKannan Achan, Sam T. Roweis, Brendan J. Frey\n\nMachine Learning Group\n\nUniversity of Toronto\n\nAbstract\n\nMany techniques for complex speech processing such as denoising and\ndeconvolution, time/frequency warping, multiple speaker separation, and\nmultiple microphone analysis operate on sequences of short-time power\nspectra (spectrograms), a representation which is often well-suited to\nthese tasks. However, a signi\ufb01cant problem with algorithms that manipu-\nlate spectrograms is that the output spectrogram does not include a phase\ncomponent, which is needed to create a time-domain signal that has good\nperceptual quality. Here we describe a generative model of time-domain\nspeech signals and their spectrograms, and show how an ef\ufb01cient opti-\nmizer can be used to \ufb01nd the maximum a posteriori speech signal, given\nthe spectrogram. In contrast to techniques that alternate between esti-\nmating the phase and a spectrally-consistent signal, our technique di-\nrectly infers the speech signal, thus jointly optimizing the phase and a\nspectrally-consistent signal. We compare our technique with a standard\nmethod using signal-to-noise ratios, but we also provide audio \ufb01les on\nthe web for the purpose of demonstrating the improvement in perceptual\nquality that our technique offers.\n\n1 Introduction\n\nWorking with a time-frequency representation of speech can have many advantages over\nprocessing the raw amplitude samples of the signal directly. Much of the structure in\nspeech and other audio signals manifests itself through simultaneous common onset, off-\nset or co-modulation of energy in multiple frequency bands, as harmonics or as coloured\nnoise bursts. Furthermore, there are many important high-level operations which are much\neasier to perform in a short-time multiband spectral representation than on the time domain\nsignal. For example, time-scale modi\ufb01cation algorithms attempt to lengthen or shorten\na signal without affecting its frequency content. The main idea is to upsample or down-\nsample the spectrogram of the signal along the time axis while leaving the frequency axis\nunwarped. Source separation or denoising algorithms often work by identifying certain\ntime-frequency regions as having high signal-to-noise or as belonging to the source of inter-\nest and \u201cmasking-out\u201d others. This masking operation is very natural in the time-frequency\ndomain. Of course, there are many clever and ef\ufb01cient speech processing algorithms for\npitch tracking[6], denoising[7], and even timescale modi\ufb01cation[4] that do operate directly\non the signal samples, but the spectral domain certainly has its advantages.\n\n\f s 1\n\n s n/2\n\n s n/2 +1\n\n s n\n\n s\n n+1\n\n s 3n/2\n\n s 3n/2 +1\n\n s 2n\n\n s N\n\nM1\n\nM3\n\nM2\n\nFigure 1: In the generative model, the spectrogram is obtained by taking overlapping win-\ndows of length n from the time-domain speech signal, and computing the energy spectrum.\n\nIn order to reap the bene\ufb01ts of working with a spectrogram of the audio, it is often important\nto \u201cinvert\u201d the spectral representation back into a time domain signal which is consistent\nwith a new time-frequency representation we obtain after processing. For example, we may\nmask out certain cells in the spectrogram after determining that they represent energy from\nnoise signals, or we may drop columns of the spectrogram to modify the timescale. How\ndo we recover the denoised or sped up speech signal? In this paper we study this inversion\nand present an ef\ufb01cient algorithm for recovering signals from their overlapping short-time\nspectral magnitudes using maximum a posteriori inference in a simple probability model.\nThis is essentially a problem of phase recovery, although with the important constraint\nthat overlapping analysis windows must agree with each other about their estimates of the\nunderlying waveform. The standard approach, exempli\ufb01ed by the classic paper of Grif\ufb01n\nand Lim [1], is to alternate between estimating the time domain signal given a current\nestimate of the phase and the observed spectrogram, and estimating the phase given the\nhypothesized signal and the observed spectrogram. Unfortunately, at any iteration, this\ntechnique maintains inconsistent estimates of the signal and the phase.\n\nOur algorithm maximizes the a posteriori probability of the estimated speech signal by\nadjusting the estimated signal samples directly, thus avoiding inconsistent phase estimates.\nAt each step of iterative optimization, the method is guaranteed to reduce the discrepancy\nbetween the observed spectrogram and the spectrogram of the estimated waveform. Fur-\nther, by jointly optimizing all samples simultaneously, the method can make global changes\nin the waveform, so as to better match all short-time spectral magnitudes.\n\n2 A Generative Model of Speech Signals and Spectrograms\n\nAn advantage of viewing phase recovery as a problem of probabilistic inference of the\nspeech signal is that a prior distribution over time-domain speech signals can be used to\nimprove performance. For example, if the identity of the speaker that produced the spec-\ntrogram is known, a speaker-speci\ufb01c speech model can be used to obtain a higher-quality\nreconstruction of the time-domain signal. However, it is important to point out that when\nprior knowledge of the speaker is not available, our technique works well using a uniform\nprior.\n\nFor a time-domain signal with N samples, let s be a column vector containing samples\ns1; : : : ; sN . We de\ufb01ne the spectrogram of a signal as the magnitude of its windowed short-\ntime Fourier transform. Let M = fm1; m2; m3::::g denote the spectrogram of s; mk is\nthe magnitude spectrum of the kth window and mf\nk is the magnitude of the f th frequency\ncomponent. Further, let n be the width of the window used to obtain the short-time trans-\nform. We assume the windows are spaced at intervals of n=2, although this assumption is\neasy to relax. In this setup, shown in Fig. 1, a particular time-domain sample st contributes\nto exactly two windows in the spectrogram.\n\nThe joint distribution over the speech signal s and the spectrogram M is\n\nP (s; M) = P (s)P (Mjs):\n\n(1)\n\n\fWe use an Rth-order autoregressive model for the prior distribution over time-domain\nspeech signals:\n\nP (s) /\n\nN\n\nY\n\nt=1\n\nexp(cid:8)(cid:0)\n\n1\n2(cid:26)2(cid:0)\n\nR\n\nX\n\nr=1\n\narst(cid:0)r (cid:0) st(cid:1)2(cid:9):\n\n(2)\n\nIn this model, each sample is predicted to be a linear combination of the r previous samples.\nThe autoregressive model can be estimated beforehand, using training data for a speci\ufb01c\nspeaker or a general class of speakers. Although this model is overly simple for general\nspeech signals, it is useful for avoiding discontinuities introduced at window boundaries by\nmis-matched phase components in neighboring frames. To avoid artifacts at frame bound-\naries, the variance of the prior can be set to low values at frame boundaries, enabling the\nprior to \u201cpave over\u201d the artifacts.\n\nAssuming that the observed spectrogram is equal to the spectrogram of the hidden speech\nsignal, plus independent Gaussian noise, the likelihood can be written\n\nP (Mjs) / Y\n\nk\n\nexp(cid:8)(cid:0)\n\n1\n2(cid:27)2 jj ^mk(s) (cid:0) mkjj2(cid:9)\n\n(3)\n\nwhere (cid:27)2 is the noise in the observed spectra, and ^mk(s) is the magnitude spectrum given\nby the appropriate window of the estimated speech signal, s. Note that the magnitude\nspectra are independent given the time domain signal.\n\nThe likelihood in (3) favors con\ufb01gurations of s that match the observed spectrogram, while\nthe prior in (2) places more weight on con\ufb01gurations that match the autoregressive model.\n\n2.1 Making the speech signal explicit in the model\n\nWe can simplify the functional form ^mk(s), by introducing the n(cid:2)n Fourier transform ma-\ntrix, F. Let sk be an n-vector containing the samples from the kth window. Using the fact\nthat the magnitude of a complex number c is cc(cid:3), where (cid:3) denotes complex conjugation,\nwe have\n\n^mk(s) = (Fsk) (cid:14) (Fsk)(cid:3) = (Fsk) (cid:14) (F(cid:3)sk);\n\nwhere (cid:14) indicates element-wise product.\nThe joint distribution in (1) can now be written\n\nR\n\nX\n\nr=1\n\nt\n\nk\n\n1\n2(cid:26)2 (\n\nexp(cid:8)(cid:0)\n\nexp(cid:8)(cid:0)\n\nP (s; M) / Y\n\n1\n2(cid:27)2 jj(Fsk)(cid:14)(F(cid:3)sk)(cid:0)mkjj2(cid:9)Y\n\narst(cid:0)r (cid:0)st)2(cid:9):\n(4)\nThe factorization of the distribution in (4) can be used to construct the factor graph shown\nin Fig. 2. For clarity, we have used a 3rd order autoregressive model and a window length of\n4. In this graphical model, function nodes are represented by black disks and each function\nnode corresponds to a term in the joint distribution. There is one function node connecting\neach observed short-time energy spectrum to the set of n time-domain samples from which\nit was possibly derived, and one function node connecting each time-domain sample to its\nR predecessors in the autoregressive model.\nTaking the logarithm of the joint distribution in (4) and expanding the norm, we obtain\nilsnk(cid:0)n=2+jsnk(cid:0)n=2+l (cid:0) mki(cid:17)2\n\nlog P (s; M) / (cid:0)\n\nFijF (cid:3)\n\nn\n\nn\n\n1\n2(cid:27)2 X\n\nk\n\nX\n\n(cid:16)\n\ni\n\nX\n\nj=1\n\nX\n\nl=1\n\n(cid:0)\n\n1\n2(cid:26)2 X\n\nt\n\n(cid:16)\n\nR\n\nX\n\nr=1\n\narst(cid:0)r (cid:0) st(cid:17)2\n\n:\n\n(5)\n\n\fg 1\n\ng 2\n\ng 3\n\ng 4\n\ng N\u22122\n\nS 1\n\nS 2\n\nS 3\n\nS 4\n\nS 5\n\nS 6\n\nS N\n\nf 1\n\nf 2\n\nM 1\n\nM 2\n\nf L\n\nM L\n\nFigure 2: Factor graph for the model in (4) using a 3rd order autoregressive model, window\nlength of 4 and an overlap of 2 samples. Function nodes fi enforce the constraint that\nthe spectrogram of s match the observed spectrogram and function nodes gi enforce the\nconstraint due to the AR model\n\nIn this expression, k indexes frames, i indexes frequency, sk(cid:0)n=2+j is the jth sample in the\nkth frame, mki is the observed spectral energy at frequency i in frame k, and ar is the rth\nautoregressive coef\ufb01cient. The log-probability is quartic in the unknown speech samples,\ns1; : : : ; sN .\nFor simplicity of presentation above, we implicitly assumed a rectangular window for com-\nputing the spectrogram. The extension to other types of windowing functions is straight-\nforward. In the experiments described below, we have used a Hamming window, and ad-\njusted the equations appropriately.\n\n3\n\nInference Algorithms\n\nThe goal of probabilistic inference is to compute the posterior distribution over speech\nwaveforms and output a typical sample or a mode of the posterior as an estimate of the\nreconstructed speech signal. To \ufb01nd a mode of the posterior, we have explored the use of\niterative conditional modes (ICM) [8], Markov chain Monte Carlo methods [9], variational\ntechniques [10], and direct application of numerical optimization methods for \ufb01nding the\nmaximum a posteriori speech signal. In this paper, we report results on two of the faster\ntechniques, ICM and direct optimization.\n\nICM operates by iteratively selecting a variable and assigning the MAP estimate to the\nvariable while keeping all other variables \ufb01xed. This technique is guaranteed to increase\nthe joint probability of the speech waveform and the observed spectrum, at each step. At\nevery stage we set st to its most probable value, given the other speech samples and the\nobserved spectrogram:\n\nt = argmaxstP (stjM; s n st) = argmaxstP (s; M):\ns(cid:3)\n\nThis value can be found by extracting the terms in (5) that depend on st and optimizing the\nresulting quartic equation with complex coef\ufb01cients. To select an initial con\ufb01guration of s,\nwe applied an inverse Fourier transform to the observed magnitude spectra M, assuming\na random phase. As will become evident in the experimental section of this paper, by\nupdating only a single sample at a time, ICM is prone to \ufb01nding poor local minima.\n\nWe also implemented an inference algorithm that directly searches for a maximum of\nlog P (s; M) w.r.t. s, using conjugate gradients. The same derivatives used to \ufb01nd the ICM\nupdates were used in a conjugate gradient optimizer, which is capable of \ufb01nding search di-\nrections in the vector space s, and jointly adjusting all speech samples simultaneously. We\n\n\f2\n\nTime (seconds)\n\n4\n\n0\n\n2\n\nTime (seconds)\n\n4\n\n0\n\n2\n\nTime (seconds)\n\n4\n\n0\n\n8\n\n)\nz\nH\nk\n(\n \n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n4\n\n0\n\nFigure 3: Reconstruction results for an utterance from the WSJ database. (left) Original\nsignal and the corresponding spectrogram.\n(middle) Reconstruction using algorithm in\n[1]. The spectrogram of the reconstruction fails to capture the \ufb01ner details in the original\nsignal. (right) Reconstruction using our algorithm. The spectrogram captures most of the\n\ufb01ne details in the original signal.\n\ninitialized the conjugate gradient optimizer using the same procedure as described above\nfor ICM.\n\n4 Experiments\n\nWe tested our algorithm using several randomly chosen utterances from the Wall street\njournal corpus and the NIST TIMIT corpus. For all experiments we used a (Hamming)\nwindow of length 256 and with an overlap of 128 samples. Where possible, we trained\na 12th order AR model of the speaker using an utterance different from the one used to\ncreate the spectrogram. For convergence to a good local minima, it is important to down\nweight the contribution of the AR-model for the \ufb01rst several iterations of conjugate gradient\noptimization. In fact we ran the algorithm without the AR model until convergence and then\nstarted the AR model with a weighting factor of 10. This way, the AR model operates on\nthe signal with very little error in the estimated spectrogram.\n\nAlong the frame boundaries, the variance of the prior (AR model) was set to a small value\nto smooth down spikes that are not very probable apriori. Further, we also tried using\na cubic spline smoother along the boundaries as a post processing step for better sound\nquality.\n\n4.1 Evaluation\n\nThe quality of sound in the estimated signal is an important factor in determining the\neffectiveness of the algorithm. To demonstrate improvement in the perceptual qual-\nity of sound we have placed audio \ufb01les on the web; for demonstrations please check,\nhttp://www.psi.toronto.edu/(cid:24)kannan/spectrogram. Our algorithm consistently outper-\nformed the algorithm proposed in [1] both in terms of sound quality and in matching the\nobserved spectrogram . Fig. 3 shows reconstruction result for an utterance from WSJ data.\n\nAs expected, ICM typically converged to a poor local minima in a few iterations. In Fig. 4,\na plot of the log probability as a function of number of iterations is shown for ICM and our\napproach.\n\n\fAlgorithm\nGrif\ufb01n and Lim [1]\nOur approach\n(without AR model)\nOur approach\n(12th order AR model)\n\ndB gain (dB)\n\n4.508\n7.900\n\n8.172\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\nP\n \ng\no\n\nl\n\n\u22127\n\n\u22128\n\n\u22129\n\n\u221210\n\n\u221211\n\n0\n\nICM\nCG\n\n10\n\n20\n\n30\n\n40\n\n50\n\niteration\n\n60\n\n70\n\n80\n\n90\n\n100\n\nFigure 4: SNR for different algorithms. Values reported are averages over 12 different\nutterances. The graph on the right compares the log probability under ICM to our algorithm\n\nAnalysis of signal to noise ratio of the true and estimated signal can be used to measure the\nquality of the estimated signal, with high dB gain indicating good reconstruction.\n\nAs the input to our model does not include a phase component, we cannot measure SNR\nby comparing the recovered signal to any true time domain signal. Instead, we de\ufb01ne the\nfollowing approximation\n\nSN R(cid:3) = X\n\nu\n\n10 log\n\nPw Pf ( 1\n\n^Eu\n\n1\n\nEu Pw Pf jsu;w(f )j2\n\nj^su;w(f )j (cid:0) 1\nEu\n\njsu;w(f )j)2\n\n(6)\n\nt is the total energy in utterance u. Summations over u, w and f are over\n\nwhere Eu = Pt s2\nall utterances, windows and frequencies respectively.\nThe table in Fig. 4 reports dB gain averaged over several utterances for [1] and our algo-\nrithm with and without an AR model.The gains for our algorithm are signi\ufb01cantly better\nthan for the algorithm of Grif\ufb01n and Lim. Moving the summation over w in (6) outside the\nlog produces similar quality estimates.\n\n4.2 Time Scale Modi\ufb01cation\n\nAs an example to show the potential utility of spectrogram inversion, we investigated an\nextremely simple approach to time scale modi\ufb01cation of speech signals. Starting from\nthe original signal we form the spectrogram (or else we may start with the spectrogram\ndirectly), and upsample or downsample it along the time axis. (For example, to speed up\nthe speech by a factor of two we can discard every second column of the spectrogram.) In\nspite of the fact that this approach does not use any phase information from the original\nsignal, it produces results with good perceptual sound quality. (Audio demonstrations are\navailable on the web site given earlier.)\n\n5 Variational Inference\n\nThe framework described so far focuses on obtaining \ufb01xed point estimates for the time do-\nmain signal by maximizing the joint log probability of the model in (5). A more important\nand potentially useful task is to \ufb01nd the posterior probability distribution P (sjM). As ex-\nact inference of P (sjM) is intractable, we approximate it using a fully factored distribution\nQ(s) where,\n\n\fQ(s) = Y\n\ni\n\nqi(si)\n\n(7)\n\nHere we assume qi(si) (cid:24) N ((cid:22)i; (cid:17)i). The goal of variational approximation is to infer the\nparameters f(cid:22)i; (cid:17)ig; 8i by minimizing the KL divergence between the approximating Q\ndistribution and the true posterior P (sjM). This is equivalent to minimizing,\n\nD = X\n\ns\n\nQ(s) log\n\nQ(s)\n\nP (s; M)\n\n= X\n\ns\n\n(Y\n\ni\n\nqi(si)) log\n\n(Qi qi(si))\nP (s; M)\n\n= (cid:0)X\n\ni\n\nH(qi) (cid:0) EQ(log P (s; M))\n\n(8)\n\nThe entropy term H(qi) is easy to compute; log P (s; M) is a quartic in the random variable\nsi and the second term involves computing the expectation of it with respect to the Q\ndistribution. Simplifying and rearranging terms we get,\n\nD = (cid:0) X\n\ni\n\nH(qi) (cid:0) (cid:16)\n\nn\n\nn\n\nX\n\nj=1\n\nX\n\nl=1\n\n+ X\n\ni\n\n(cid:17)2\ni Gi((cid:22); (cid:17))\n\nFijF (cid:3)\n\nil(cid:22)nk(cid:0)n=2+j(cid:22)nk(cid:0)n=2+l (cid:0) mki(cid:17)2\n\n(9)\n\nGi((cid:22); (cid:17)) accounts for uncertainty in s. Estimates with high uncertainty ((cid:17)) will tend to\nhave very little in\ufb02uence on other estimates during the optimization. Another interesting\naspect of this formulation is that by setting (cid:17) = 0, the \ufb01rst and third terms in (9) vanish\nand D takes a form similar to (5). In other words, in the absence of uncertainty we are in\nessence \ufb01nding \ufb01xed point estimates for s.\n\n6 Conclusion\n\nIn this paper, we have introduced a simple probabilistic model of noisy spectrograms in\nwhich the samples of the unknown time domain signal are represented directly as hidden\nvariables. But using a continuous gradient optimizer on these quantities, we are able to ac-\ncurately estimate the full speech signal from only the short time spectral magnitudes taken\nin overlapping windows. Our algorithm\u2019s reconstructions are substantially better, both in\nterms of informal perceptual quality and measured signal to noise ratio, than the standard\napproach of Grif\ufb01n and Lim[1]. Furthermore, in our setting, it is easy to incorporate an\na-priori model of gross speech structure in the form of an AR-model, whose in\ufb02uence on\nthe reconstruction is user-tunable. Spectrogram inversion has many potential applications;\nas an example we have demonstrated an extremely simple but nonetheless effective time\nscale modi\ufb01cation algorithm which subsamples the spectrogram of the original utterance\nand then inverts.\n\nIn addition to improved experimental results, our approach highlights two important lessons\nfrom the point of view of statistical signal processing algorithms. The \ufb01rst is that directly\nrepresenting quantities of interest and making inferences about them using the machinery of\nprobabilistic inference is a powerful approach that can avoid the pitfalls of less principled\n\n\fiterative algorithms that maintain inconsistent estimates of redundant quantities, such as\nphase and time-domain signals. The second is that coordinate descent optimization (ICM)\ndoes not always yield the best results in problems with highly dependent hidden variables.\nIt is often tacitly assumed in the graphical models community, that the more structured\nan approximation one can make when updating blocks of parameters simultaneously, the\nbetter. In other words, practitioners often try to solve for as more variables as possible\nconditioned on quantities that have just been updated. Our experience in this model has\nshown that direct continuous optimization using gradient techniques allows all quantities\nto adjust simultaneously and ultimately \ufb01nds far superior solutions.\n\nBecause of its probabilistic nature, our model can easily be extended to include other pieces\nof prior information, or to deal with missing or noisy spectrogram frames. This opens the\ndoor to uni\ufb01ed phase recovery and denoising algorithms, and to the possibility of perform-\ning sophisticated speech separation or denoising inside the pipeline of a standard speech\nrecognition system, in which typically only short time spectral magnitudes are available.\n\nAcknowledgments\n\nWe thank Carl Rasmussen for his conjugate gradient optimizer. KA, STR and BJF are\nsupported in part by the Natural Sciences and Engineering Research Council of Canada.\nBJF and STR are supported in part by the Ontario Premier\u2019s Research Excellence Award.\nSTR is supported in part by the Learning Project of IRIS Canada.\n\nReferences\n\n[1] Grif\ufb01n, D. W and Lim, J. S Signal estimation from modi\ufb01ed short time Fourier trans-\n\nform In IEEE Transactions on Acoustics, Speech and Signal Processing, 1984 32/2\n\n[2] Kschischang, F. R., Frey, B. J. and Loeliger, H. A. Probability propagation and itera-\ntive decoding.Factor graphs and the sum-product algorithm In IEEE Transactions on\nInformation Theory, 2001 47\n\n[3] Fletcher, R Practical methods of optimization . John Wiley & Sons, 1987.\n[4] Roucos, S. and A. M. Wilgus. High Quality Time-Scale Modi\ufb01cation for Speech. In\nProceedings of the International Conference on Acoustics, Speech, and Signal Pro-\ncessing, IEEE, 1985, 493-496.\n\n[5] Rabiner, L. and Juang, B. Fundamentals of Speech Recognition. Prentice Hall, 1993\n[6] L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun Real time voice processing with\naudiovisual feedback: toward autonomous agents with perfect pitch. in S. Becker, S.\nThrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems\n15. MIT Press: Cambridge, MA, 2003\n\n[7] Eric A. Wan and Alex T. Nelson Removal of noise from speech using the dual EKF\nalgorithm in Proceedings of the International Conference on Acoustics, Speech, and\nSignal Processing (ICASSP), IEEE, May, 1998\n\n[8] Besag, J On the statistical analysis of dirty pictures Journal of the Royal Statistical\n\nSociety B vol.48, pg 259\u2013302, 1986\n\n[9] Neal, R. M, Probabilistic inference using Markov chain Monte Carlo Methods, Uni-\n\nversity of Toronto Technical Report 1993\n\n[10] M. I. Jordan and Z. Ghahramani and T. S. Jaakkola and L. K. Saul An introduction\nto variational methods for graphical models Learning in Graphical Models, edited by\nM. I. Jordan, Kluwer Academic Publishers, Norwell MA., 1998.\n\n\f", "award": [], "sourceid": 2358, "authors": [{"given_name": "Kannan", "family_name": "Achan", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Brendan", "family_name": "Frey", "institution": null}]}