{"title": "One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1424, "abstract": "", "full_text": "One microphone blind dereverberation\n\nbased on quasi-periodicity of speech signals\n\nTomohiro Nakatani, Masato Miyoshi, and Keisuke Kinoshita\n\nSpeech Open Lab., NTT Communication Science Labs., NTT Corporation\n\n2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan\n\n{nak,miyo,kinoshita}@cslab.kecl.ntt.co.jp\n\nAbstract\n\nSpeech dereverberation is desirable with a view to achieving, for exam-\nple, robust speech recognition in the real world. However, it is still a chal-\nlenging problem, especially when using a single microphone. Although\nblind equalization techniques have been exploited, they cannot deal with\nspeech signals appropriately because their assumptions are not satis\ufb01ed\nby speech signals. We propose a new dereverberation principle based\non an inherent property of speech signals, namely quasi-periodicity. The\npresent methods learn the dereverberation \ufb01lter from a lot of speech data\nwith no prior knowledge of the data, and can achieve high quality speech\ndereverberation especially when the reverberation time is long.\n\n1\n\nIntroduction\n\nAlthough numerous studies have been undertaken on robust automatic speech recognition\n(ASR) in the real world, long reverberation is still a serious problem that severely degrades\nthe ASR performance [1]. One simple way to overcome this problem is to dereverberate\nthe speech signals prior to ASR, but this is also a challenging problem, especially when\nusing a single microphone. For example, certain blind equalization methods, including\nindependent component analysis (ICA), can estimate the inverse \ufb01lter of an unknown im-\npulse response convolved with target signals when the signals are statistically independent\nand identically distributed sequences [2]. However, these methods cannot appropriately\ndeal with speech signals because speech signals have inherent properties, such as period-\nicity and formant structure, making their sequences statistically dependent. This approach\ninevitably destroys such essential properties of speech. Another approach that uses the\nproperties of speech has also been proposed [3]. The basic idea involves adaptively de-\ntecting time regions in which signal-to-reverberation ratios become small, and attenuating\nspeech signals in those regions. However, the precise separation of the signal and reverber-\nation durations is dif\ufb01cult, therefore, this approach has achieved only moderate results so\nfar.\nIn this paper, we propose a new principle for estimating an inverse \ufb01lter by using an es-\nsential property of speech signals, namely quasi-periodicity, as a clue. In general, voiced\nsegments in an utterance have approximate periodicity in each local time region while the\nperiod gradually changes. Therefore, when a long reverberation is added to a speech signal,\nsignals in different time regions with different periods are mixed, thus degrading the peri-\nodicity of the signals in local time regions. By contrast, we show that we can estimate an\ninverse \ufb01lter for dereverberating a signal by enhancing the periodicity of the signal in each\n\n\flocal time region. The estimated \ufb01lter can dereverberate both the periodic and non-periodic\nparts of speech signals with no prior knowledge of the target signals, even though only the\nperiodic parts of the signals are used for the estimation.\n\n2 Quasi-periodicity based dereverberation\n\nWe propose two dereverberation methods, referred to as Harmonicity based dEReverBera-\ntion (HERB) methods, based on the features of quasi-periodic signals: one based on an Av-\nerage Transfer Function (ATF) that transforms reverberant signals into quasi-periodic com-\nponents (ATF-HERB), and the other based on the Minimum Mean Squared Error (MMSE)\ncriterion that evaluates the quasi-periodicity of target signals (MMSE-HERB). First, we\nbrie\ufb02y explain the features of quasi-periodic signals, and then describe the two methods.\n\n2.1 Features of quasi-periodic signals\nWhen a source signal s(n) is recorded in a reverberant room1, the obtained signal x(n) is\nrepresented as x(n) = h(n)\u2217s(n), where h(n) is the impulse response of the room and \u201c\u2217\u201d\nis a convolution operation. The goal of the dereverberation is to estimate a dereverberation\n\ufb01lter, w(n), for \u2212N < n < N that dereverberates x(n), and to obtain the dereverberated\nsignal y(n) by:\n\ny(n) =w (n) \u2217 x(n) = (w(n) \u2217 h(n)) \u2217 s(n) = q(n) \u2217 s(n).\n\napproximately a periodic signal whose period is T (n0).\n\n(1)\nwhere q(n) = w(n) \u2217 h(n) is referred to as a dereverberated impulse response. Here, we\nassume s(n) is a quasi-periodic signal2, which has the following features:\n1. In each local time region around n0 (n0 \u2212 \u03b4 < n < n0 + \u03b4 for\n2. Outside the region (|n\n\nn0), s(n) is\n(cid:2)) is also a periodic signal within its\nneighboring time region, but often has another period that is different from T (n0).\nThese features make x(n) a non-periodic signal even within local time regions when h(m)\ncontains non-zero values for |m| > \u03b4. This is because more than two periodic signals,\ns(n) and s(n \u2212 m), that have different periods, are added to x(n) with weights of h(0)\nand h(m). Inversely, the goal of our dereverberation is to estimate w(n) that makes y(n)\na periodic signal in each local time region. Once such a \ufb01lter is obtained, q(m) must have\nzero values for |m| > \u03b4, and thus, reverberant components longer than \u03b4 are eliminated\nfrom y(n).\nAn important additional feature of a quasi-periodic signal is that quasi-periodic components\nin a source signal can be enhanced by an adaptive harmonic \ufb01lter. An adaptive harmonic\n\ufb01lter is a time-varying linear \ufb01lter that enhances frequency components whose frequencies\ncorrespond to multiples of the fundamental frequency (F0) of the target signal, while pre-\nserving their phases and amplitudes. The \ufb01lter values are adaptively modi\ufb01ed according to\nF0. For example, a \ufb01lter, F (f0(n))[\u00b7], can be implemented as follows:\n\n(cid:2) \u2212 n0| > \u03b4), s(n\n\n\u2200\n\n\u02c6x(n) = F (f0(n))[x(n)],\n\ng2(n \u2212 n0)Re{x(n) \u2217 (g1(n)\n\n=\n\n(cid:1)\n\nn0\n\nexp(j2\u03c0kf0(n0)n/fs))},\n\n(2)\n\n(3)\n\n(cid:1)\n\nk\n\nwhere n0 is the center time of each frame, f0(n0) is the fundamental frequency (F0) of\nthe signal at the frame, k is a harmonics index, g1(n) and g2(n) are analysis window\n1In this paper, time domain and frequency domain signals are represented by non-capitalized\nand capitalized symbols, respectively. Arguments \u201c(\u03c9)\u201d that represent the center frequencies of the\ndiscrete Fourier transformation bins are often omitted from frequency domain signals.\n\n2Later, this assumption is extended so that s(n) is composed of quasi-periodic components and\n\nnon-periodic components in the case of speech signals.\n\n\fS(\u03c9)\n\nH(\u03c9)\n\nW(\u03c9)\n\nY(\u03c9)\n\n^\n\nE(X/X)\n\nX(\u03c9)\n\nF(f0)\n\n^\nX(\u03c9)\n\nFigure 1: Diagram of ATF-HERB\n\nfunctions, and fs is the sampling frequency. Even when x(n) contains a long reverberation,\nthe reverberant components that have different frequencies from s(n) are reduced by the\nharmonic \ufb01lter, and thus, the quasi-periodic components can be enhanced.\n\n2.2 ATF-HERB: average transfer function based dereverberation\n\nFigure 1 is a diagram of ATF-HERB, which uses the average transfer function from re-\nverberant signals to quasi-periodic signals. A speech signal, S(\u03c9), can be modeled by the\nsum of the quasi-periodic components, or voiced components, Sh(\u03c9), and non-periodic\ncomponents, or unvoiced components, Sn(\u03c9), as eq. (4). The reverberant observed signal,\nX(\u03c9), is then represented by the product of S and the transfer function, H(\u03c9), of a room as\neq. (5). The transfer function, H, can also be divided into two functions, D(\u03c9) and R(\u03c9).\nThe former transforms S into the direct signal, DS, and the latter into the reverberation\npart, RS, as shown in eq. (6). X is also represented by the sum of the direct signal of the\nquasi-periodic components, DSh, and the other components as eq. (7).\n\nS(\u03c9) =S h(\u03c9) + Sn(\u03c9),\nX(\u03c9) = H(\u03c9)S(\u03c9),\n\n= (D(\u03c9) + R(\u03c9))S(\u03c9),\n= DSh + (RSh + HSn).\n\n(4)\n(5)\n(6)\n(7)\n\nOf these components, DSh can approximately be extracted from X by harmonic \ufb01ltering.\nAlthough the frequencies of quasi-periodic components change dynamically according to\nthe changes in their fundamental frequency (F0), their reverberation remains unchanged at\nthe same frequency. Therefore, direct quasi-periodic components, DSh, can be enhanced\nby extracting frequency components located at multiples of its F0. This approximated\ndirect signal \u02c6X(\u03c9) can be modeled as follows:\n\n\u02c6X(\u03c9) =D (\u03c9)Sh(\u03c9) + ( \u02c6R(\u03c9)Sh(\u03c9) + \u02c6N(\u03c9)),\n\n(8)\nwhere \u02c6R(\u03c9)Sh(\u03c9) and \u02c6N(\u03c9) are part of the reverberation of Sh and part of the direct signal\nand reverberation of Sn, which unexpectedly remain in \u02c6X after the harmonic \ufb01ltering3. We\nassume that all the estimation errors in \u02c6X are caused by \u02c6RSh and \u02c6N in eq. (8).\nThe goal of ATF-HERB is to estimate O( \u02c6R(\u03c9)) = (D(\u03c9) + \u02c6R(\u03c9))/H(\u03c9), referred to as\na \u201cdereverberation operator.\u201d This is because the signal DS + \u02c6RS, which can be obtained\nby multiplying O( \u02c6R) by X, becomes in a sense a dereverberated signal.\n\nO( \u02c6R(\u03c9))X(\u03c9) = D(\u03c9)S(\u03c9) + \u02c6R(\u03c9)S(\u03c9),\n\n(9)\n3Strictly speaking, \u02c6R cannot be represented as a linear transformation because the reverberation\n\nincluded in \u02c6X depends on the time pattern of \u02c6X. We introduce this approximation for simplicity.\n\n\fS(\u03c9)\n\nH(\u03c9)\n\nX(\u03c9)\n\nW(\u03c9)\n\nY(\u03c9)\n\nMMSE\n\nF(f0)\n\n^\nX(\u03c9)\n\nFigure 2: Diagram of MMSE-HERB\n\nwhere the right side of eq. (9) is composed of a direct signal, DS, and certain parts of the\nreverberation, \u02c6RS. The rest of the reverberation included in X(= DS +RS), or (R\u2212 \u02c6R)S,\nis eliminated by the dereverberation operator.\nTo estimate the dereverberation operator, we use the output of the harmonic \ufb01lter, \u02c6X. Sup-\npose a number of X values are obtained and \u02c6X values are calculated from individual X\nvalues. Then, the dereverberation operator, O( \u02c6R), can be approximated as the average of\n\u02c6X/X, orW (\u03c9) = E( \u02c6X/X). W (\u03c9) is shown to be a good estimate of O( \u02c6R) by substitut-\ning E( \u02c6X/X) for eqs. (4), (5) and (8) as eq. (11).\n\nW (\u03c9) = E( \u02c6X/X),\n= O( \u02c6R(\u03c9))E(\n) +E (\n(cid:2) O( \u02c6R(\u03c9))P (|Sh(\u03c9)| > |Sn(\u03c9)|),\n\n1 + Sn/Sh\n\n1\n\n1\n\n1 + (X \u2212 \u02c6N)/ \u02c6N\n\n(10)\n\n(11)\n\n),\n\n(12)\nwhere P (\u00b7) is a probability function. The arguments of the two average functions in eq. (11)\nhave the form of a complex function, f(z) = 1/(1 + z). E(f(z)) is easily proven to equal\nP (|z| < 1), using the residue theorem if it is assumed that the phase of z is uniformly\ndistributed, the phases of z and |z| are independent, and |z| (cid:3)= 1. Based on this property, the\nsecond term of eq. (11) approximately equals zero because \u02c6N is a non-periodic component\nthat the harmonic \ufb01lter unexpectedly extracts and thus the magnitude of \u02c6N almost always\nhas a smaller value than (Y \u2212 \u02c6N) if a suf\ufb01ciently long analysis window is used. Therefore,\nW (\u03c9) can be approximated by eq. (12), that is, W (\u03c9) has the value of the dereverberation\noperator multiplied by the probability of the harmonic components of speech with a larger\nmagnitude than the non-periodic components.\nOnce the dereverberation operator is calculated from the periodic parts of speech signals\nfor almost all the frequency ranges, it can dereverberate both the periodic and non-periodic\nparts of the signals because the inverse transfer function is independent of the source signal\ncharacteristics. Instead, the gain of W (\u03c9) tends to decrease with frequency when using\nour method. This is because the magnitudes of the non-periodic components relative to\nthe periodic components tend to increase with frequency for a speech signal, and thus\nthe P (|Sh| > |Sn|) value becomes smaller as \u03c9 increases. To compensate for this de-\ncreasing gain, it may be useful to use the average attributes of speech on the probability,\nP (|Sh| > |Sn|). In our experiments in section 4, however, W (\u03c9) itself was used as the\ndereverberation operator without any compensation.\n\n2.3 MMSE-HERB: minimum mean squared error criterion based dereverberation\n\nAs discussed in section 2.1, quasi-periodic signals can be dereverberated simply by en-\nhancing their quasi-periodicity. To implement this principle directly, we introduce a cost\n\n\fn\n\nn\n\n(cid:1)\n\n(cid:1)\n\nfunction, referred to as the minimum mean squared error (MMSE) criterion, to evaluate the\nquasi-periodicity of the signals as follows:\n(y(n) \u2212 F (f0(n))[y(n)])2 =\nC(w) =\n\n(w(n) \u2217 x(n) \u2212 F (f0(n))[w(n) \u2217 x(n)])2,\n(13)\nwhere y(n) = w(n) \u2217 x(n) is a target signal that should be dereverberated by controlling\nw(n), and F (f0(n))[y(n)] is a signal obtained by applying a harmonic \ufb01lter to y(n). When\ny(n) is a quasi-periodic signal, y(n) approximately equals F (f0(n))[y(n)] because of the\nfeature of quasi-periodic signals, and thus, the above cost function is expected to have the\nminimum value. Inversely, the \ufb01lter, w(n), that minimizes C(w) is expected to enhance\nthe quasi-periodicity of x(n). Such \ufb01lter parameters can, for example, be obtained us-\ning optimization algorithms such as a hill-climbing method using the derivatives of C(w)\ncalculated as follows:\n(cid:1)\n\n= 2\n\n\u2202C(w)\n\u2202w(l)\n\n(y(n) \u2212 F (f0(n))[y(n)])(x(n \u2212 l) \u2212 F (f0(n))[x(n \u2212 l)]),\n\n(14)\nwhere F (f0(n))[x(n \u2212 l)]) is a signal obtained by applying the adaptive harmonic \ufb01lter to\nx(n \u2212 l)4.\nThere are, however, several problems involved in directly using eq. (13) as the cost function.\n\nn\n\n1. As discussed in section 2.1, the values of the dereverberated impulse response,\nq(n), are expected to become zero using this method where |n| > \u03b4, however,\nthe values are not speci\ufb01cally determined where |n| < \u03b4. This may cause unex-\npected spectral modi\ufb01cation of the dereverberated signal. Additional constraints\nare required in order to specify these values.\n\n2. The cost function has a self-evident solution, that is, w(l) = 0 for all l values. This\nsolution means that the signal, y(n), is always zero instead of being dereverber-\nl w(l)2 = 1,\nated, and therefore, should be excluded. Some constraints, such as\nmay be useful for solving this problem.\n\n(cid:2)\n\n3. The complexity of the computing needed to minimize the cost function based on\nrepetitive estimation increases as the dereverberation \ufb01lter becomes longer. The\nlonger the reverberation becomes, the longer the dereverberation \ufb01lter should be.\n\nTo overcome these problems, we simplify the cost function in this paper. The new cost\nfunction is de\ufb01ned as follows:\n\nC(W (\u03c9)) = E((Y (\u03c9) \u2212 \u02c6X(\u03c9))2) = E((W (\u03c9)X(\u03c9) \u2212 \u02c6X(\u03c9))2),\n\n(15)\nwhere Y (\u03c9), X(\u03c9), and \u02c6X(\u03c9) are discrete Fourier transformations of y(n), x(n), and\nF (f0(n))[x(n)], respectively. The new cost function evaluates the quasi-periodicity not in\nthe time domain but in the frequency domain, and uses a \ufb01xed quasi-periodic signal \u02c6X(\u03c9)\nas the desired signal, instead of using the non-\ufb01xed quasi-periodic signal, F (f0(n))[y(n)].\nThis modi\ufb01cation allows us to solve the above problems. The use of the \ufb01xed desired\nsignals speci\ufb01cally provides the dereverberated impulse response, q(n), with the desired\nvalues, even in the time region, |n| < \u03b4. In addition, the self-evident solution, w(l) = 0, can\nno longer be optimal in terms of the cost function. Furthermore, the computing complexity\nis greatly reduced because the solution can be given analytically as follows:\n\nW (\u03c9) = E( \u02c6X(\u03c9)X\n\n\u2217(\u03c9))\nE(X(\u03c9)X\u2217(\u03c9)) .\n\n(16)\n\nA diagram of this simpli\ufb01ed MMSE-HERB is shown in Fig. 2.\n\n4F (f0(n))[x(n \u2212 l)]) is not the same signal as \u02c6x(n \u2212 l). When calculating F (f0(n))[x(n \u2212 l)],\n\nx(n) is time-shifted with l-points while f0(n) of the adaptive harmonic \ufb01lter is not time-shifted.\n\n\fSTEP1:\n\nInput X\n\nSTEP2:\nX^\nO(R1)\n\nF0 \nestimation\n\nX\n\nF0\n\nAdaptive\nharmonics\nfilter\n\nX\n\nX1^\n\nF0 \nestimation\n\n^\n\nO(R1)X\n\n^\n\nO(R1)X\n\nAdaptive\nharmonics\nfilter\n\nF0\n\nX2^\n\nDereverbe-\nration\noperator\nestimation\n\nDereverbe-\nration\noperator\nestimation\n\nX\n\n^\n\nO(R1)\n\n^\n\nO(R1)X\n\n^\n\nO(R2)\n\nDereverbe-\nration\nby\n\nO(R1)\n\n^\n\nDereverbe-\nration\nby\n\nO(R2)\n\n^\n\n^\n\nO(R1)X\n\n^\n\nO(R2)\n\n^\n\nO(R1) X\n\nFigure 3: Processing \ufb02ow of dereverberation.\n\n\u2217\nn)\n\n(17)\n\nE(ShS\n\n\u2217\nh)\n\nE(ShS\n\nE(ShS\n\n+\n\n1\nH\n\n\u2217\nh) =E ( \u02c6N S\n\n\u2217\nn) = E(SnS\n\nE(ShS\n\u2217\nh) +E (SnS\u2217\nn)\nE(ShS\n\u2217\nh) +E (SnS\u2217\nn) .\n\n\u2217\nh)\n\n\u2217\nWhen we assume the model of \u02c6X in eq. (8), and E(ShS\nh) = 0,\nit is shown that the resulting W in eq. (16) again approaches the dereverberation operator,\nO( \u02c6R), presented in section 2.2:\nW (\u03c9) = O( \u02c6R(\u03c9))\n(cid:2) O( \u02c6R(\u03c9))\n\n(18)\nBecause \u02c6N represents non-periodic components that are included unexpectedly and at ran-\ndom in the output of the harmonic \ufb01lter, the absolute value of the second term in eq. (17)\nis expected to be suf\ufb01ciently small compared with that of the \ufb01rst term, therefore, we\ndisregard this term. Then, W (\u03c9) in eq. (16) becomes the dereverberation operator mul-\ntiplied by the ratio of the expected power of the quasi-periodic components in the sig-\nnals to that of whole signals. As with the speech signals discussed in section 2.2, the\n\u2217\nE(ShS\nn)) value becomes smaller as \u03c9 increases, and thus, the\ngain of W (\u03c9) tends to decrease. Therefore, the same frequency compensation scenario as\nfound in section 2.2 may again be useful for the MMSE based dereverberation scheme.\n\nE( \u02c6N S\n\u2217\nh) + E(SnS\u2217\nn) ,\n\n\u2217\nh)/(E(ShS\n\n\u2217\nh) +E (SnS\n\n3 Processing \ufb02ow\n\nBased on the above two methods, we constructed a dereverberation algorithm composed of\ntwo steps as shown in Fig. 3. Both methods are implemented in the same processing \ufb02ow\nexcept that the methods used to calculate the dereverberation operator are different. The\n\ufb02ow is summarized as follows:\n\n1. In the \ufb01rst step, F0 is estimated from the reverberant signal, X. Then the harmonic\ncomponents included in X are estimated as \u02c6X1 based on adaptive harmonic \ufb01l-\ntering. The dereverberation operator O( \u02c6R1) is then calculated by ATF-HERB or\nMMSE-HERB for a number of reverberant speech signals. Finally, the derever-\nberated signal is obtained by multiplying O( \u02c6R1) by X.\n\n2. The second step employs almost the same procedures as the \ufb01rst step except that\nthe speech data dereverberated by the \ufb01rst step are used as the input signal. The\nuse of this dereverberated input signal means that reverberant components, \u02c6R2X2,\ninevitably included in eq. (8) can be attenuated. Therefore, a more effective dere-\nverberation can be achieved in step 2.\n\nIn our preliminary experiments, however, repeating STEP 2 did not always improve the\nquality of the dereverberated signals. This is because the estimation error of the dereverber-\nation operators accumulates in the dereverberated signals when the signals are multiplied\nby more than one dereverberation operator. Therefore, in our experiments, we used STEP 2\nonly once. A more detailed explanation of these processing steps is also presented in [4].\n\n\f0\n\n)\n\nB\nd\n(\n \nr\ne\nw\no\nP\n\n\u221220\n\n\u221240\n\n\u221260\n0\n\n0\n\n)\n\nB\nd\n(\n \nr\ne\nw\no\nP\n\n\u221220\n\n\u221240\n\n\u221260\n0\n\nrtime=1.0 sec.\n\n0.2\n\n0.4\n\nTime (sec.)\n\n0.6\n\n0.8\n\nrtime=0.2 sec.\n\n0\n\n)\n\nB\nd\n(\n \nr\ne\nw\no\nP\n\n\u221220\n\n\u221240\n\n\u221260\n0\n\n0\n\n)\n\nB\nd\n(\n \nr\ne\nw\no\nP\n\n\u221220\n\n\u221240\n\nrtime=0.5 sec.\n\n0.2\n\n0.4\n\nTime (sec.)\n\n0.6\n\n0.8\n\nrtime=0.1 sec.\n\n0.2\n\n0.4\n\nTime (sec.)\n\n0.6\n\n0.8\n\n\u221260\n0\n\n0.2\n\n0.4\n\nTime (sec.)\n\n0.6\n\n0.8\n\nFigure 4: Reverberation curves of the original impulse responses (thin line) and derever-\nberated impulse responses (male: thick dashed line, female: thick solid line) for different\nreverberation times (rtime).\n\nAccurate F0 estimation is very important in terms of achieving effective dereverberation\nwith our methods in this processing \ufb02ow. However, this is a dif\ufb01cult task, especially for\nspeech with a long reverberation using existing F0 estimators. To cope with this problem,\nwe designed a simple \ufb01lter that attenuates a signal that continues at the same frequency, and\nused it as a preprocessor for the F0 estimation [5]. In addition, the dereverberation operator,\nO( \u02c6R1), itself is a very effective preprocessor for an F0 estimator because the reverberation\nof the speech can be directly reduced by the operator. This mechanism is already included\nin step 2 of the dereverberation procedure, that is, F0 estimation is applied to O( \u02c6R1)X.\nTherefore, more accurate F0 can be obtained in step 2 than in step 1.\n\n4 Experimental results\n\nWe examined the performance of the proposed dereverberation methods. Almost the same\nresults were obtained with the two methods, and so we only describe those obtained with\nATF-HERB. We used 5240 Japanese word utterances provided by a male and a female\nspeaker (MAU and FKM, 12 kHz sampling) included in the ATR database as source signals,\nS(\u03c9). We used four impulse responses measured in a reverberant room whose reverberation\ntimes were about 0.1, 0.2, 0.5, and 1.0 sec, respectively. Reverberant signals, X(\u03c9), were\nobtained by convolving S(\u03c9) with the impulse responses.\nFigure 4 depicts the reverberation curves5 of the original impulse responses and the derever-\nberated impulse responses obtained with ATF-HERB. The \ufb01gure shows that the proposed\nmethods could effectively reduce the reverberation in the impulse responses for the female\nspeaker when the reverberation time (rtime) was longer than 0.1 sec. For the male speaker,\nthe reverberation effect in the lower time region was also effectively reduced. This means\nthat strong reverberant components were eliminated, and we can expect the intelligibility\nof the signals to be improved [6].\nFigure 5 shows spectrograms of reverberant and dereverberated speech signals when rtime\nwas 1.0 sec. As shown in the \ufb01gure, the reverberation of the signal was effectively reduced,\nand the formant structure of the signal was restored. Similar spectrogram features were\nobserved under other reverberation conditions, and an improvement in sound quality could\nclearly be recognized by listening to the dereverberated signals [7]. We also evaluated the\nquality of the dereverberated speech in terms of speaker dependent word recognition rates\n\n5The reverberation curve shows the reduction in the energy of a room impulse response with time\n\n[6].\n\n\f2\n\n1\n\n)\nz\nH\nk\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n0\n0\n\n2\n\n1\n\n)\nz\nH\nk\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n0.4\n\n0.8\nTime (sec.)\n\n1.2\n\n0\n0\n\n0.4\n\n0.8\nTime (sec.)\n\n1.2\n\nFigure 5: Spectrogram of reverberant (left) and dereverberated (right) speech of a male\nspeaker uttering \u201cba-ku-da-i\u201d.\n\nwith an ASR system, and could achieve more than 95 % recognition rates under all the\nreverberation conditions with acoustic models trained using dereverberated speech signals.\nDetailed information on the ASR experiments is also provided in [4].\n\n5 Conclusion\n\nA new blind dereverberation principle based on the quasi-periodicity of speech signals was\nproposed. We presented two types of dereverberation method, referred to as harmonic-\nity based dereverberation (HERB) method: one estimates the average \ufb01lter function that\ntransforms reverberant signals into quasi-periodic signals (ATF-HERB) and the other min-\nimizes the MMSE criterion that evaluates the quasi-periodicity of signals (MMSE-HERB).\nWe showed that ATF-HERB and a simpli\ufb01ed version of MMSE-HERB are both capable\nof learning the dereverberation operator that can reduce reverberant components in speech\nsignals. Experimental results showed that a dereverberation operator trained with 5240\nJapanese word utterances could achieve very high quality speech dereverberation. Future\nwork will include an investigation of how such high quality speech dereverberation can be\nachieved with fewer speech data.\n\nReferences\n\n[1] Baba, A., Lee, A., Saruwatari, H., and Shikano, K., \u201cSpeech recognition by rever-\nberation adapted acoustic model,\u201d Proc. of ASJ general meeting, pp. 27\u201328, Akita,\nJapan, Sep., 2002.\n\n[2] Amari, S., Douglas, S. C., Cichocki, A., and Yang, H. H., \u201cMultichannel blind decon-\nvolution and equalization using the natural gradient,\u201d Proc. IEEE Workshop on Signal\nProcessing Advances in Wireless Communications, Paris, pp. 101-104, April 1997.\n\n[3] Yegnanarayana, B., and Murthy, P. S., \u201cEnhancement of reverberant speech using LP\n\nresidual signal,\u201d IEEE Trans. SAP vol. 8, no. 3, pp. 267\u2013281, 2000.\n\n[4] Nakatani, T., Miyoshi, M., and Kinoshita, K., \u201cImplementation and effects of single\nchannel dereverberation based on the harmonic structure of speech,\u201d Proc. IWAENC-\n2003, Sep., 2003.\n\n[5] Nakatani, T., and Miyoshi, M., \u201cBlind dereverberation of single channel speech signal\n\nbased on harmonic structure,\u201d Proc. ICASSP-2003, vol. 1, pp. 92\u201395, Apr., 2003.\n\n[6] Yegnanarayana, B., and Ramakrishna, B. S., \u201cIntelligibility of speech under nonex-\n\nponential decay conditions,\u201d JASA, vol. 58, pp. 853\u2013857, Oct. 1975.\n\n[7] http://www.kecl.ntt.co.jp/icl/signal/nakatani/sound-demos/dm/derev-demos.html\n\n\f", "award": [], "sourceid": 2436, "authors": [{"given_name": "Tomohiro", "family_name": "Nakatani", "institution": null}, {"given_name": "Masato", "family_name": "Miyoshi", "institution": null}, {"given_name": "Keisuke", "family_name": "Kinoshita", "institution": null}]}