{"title": "Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch", "book": "Advances in Neural Information Processing Systems", "page_first": 1205, "page_last": 1212, "abstract": null, "full_text": "Real Time Voice Processing with Audiovisual\n\nFeedback: Toward Autonomous Agents\n\nwith Perfect Pitch\n\nLawrence K. Saul1, Daniel D. Lee2, Charles L. Isbell3, and Yann LeCun4\n\n1 Department of Computer and Information Science\n2Department of Electrical and System Engineering\n\nUniversity of Pennsylvania, 200 South 33rd St, Philadelphia, PA 19104\n\n3Georgia Tech College of Computing, 801 Atlantic Drive, Atlanta, GA 30332\n\n4NEC Research Institute, 4 Independence Way, Princeton, NJ 08540\n\nlsaul@cis.upenn.edu, ddlee@ee.upenn.edu, isbell@cc.gatech.edu, yann@research.nj.nec.com\n\nAbstract\n\nWe have implemented a real time front end for detecting voiced speech\nand estimating its fundamental frequency. The front end performs the\nsignal processing for voice-driven agents that attend to the pitch contours\nof human speech and provide continuous audiovisual feedback. The al-\ngorithm we use for pitch tracking has several distinguishing features: it\nmakes no use of FFTs or autocorrelation at the pitch period; it updates the\npitch incrementally on a sample-by-sample basis; it avoids peak picking\nand does not require interpolation in time or frequency to obtain high res-\nolution estimates; and it works reliably over a four octave range, in real\ntime, without the need for postprocessing to produce smooth contours.\nThe algorithm is based on two simple ideas in neural computation: the\nintroduction of a purposeful nonlinearity, and the error signal of a least\nsquares \ufb01t. The pitch tracker is used in two real time multimedia applica-\ntions: a voice-to-MIDI player that synthesizes electronic music from vo-\ncalized melodies, and an audiovisual Karaoke machine with multimodal\nfeedback. Both applications run on a laptop and display the user\u2019s pitch\nscrolling across the screen as he or she sings into the computer.\n\n1 Introduction\n\nThe pitch of the human voice is one of its most easily and rapidly controlled acoustic at-\ntributes. It plays a central role in both the production and perception of speech[17]. In clean\nspeech, and even in corrupted speech, pitch is generally perceived with great accuracy[2, 6]\nat the fundamental frequency characterizing the vibration of the speaker\u2019s vocal chords.\n\nThere is a large literature on machine algorithms for pitch tracking[7], as well as appli-\ncations to speech synthesis, coding, and recognition. Most algorithms have one or more\nof the following components. First, sliding windows of speech are analyzed at 5-10 ms\nintervals, and the results concatenated over time to obtain an initial estimate of the pitch\ncontour. Second, within each window (30-60 ms), the pitch is deduced from peaks in the\n\n\fwindowed autocorrelation function[13] or power spectrum[9, 10, 15], then re\ufb01ned by fur-\nther interpolation in time or frequency. Third, the estimated pitch contours are smoothed\nby a postprocessing procedure[16], such as dynamic programming or median \ufb01ltering, to\nremove octave errors and isolated glitches.\n\nIn this paper, we describe an algorithm for pitch tracking that works quite differently\nand\u2014based on our experience\u2014quite well as a real time front end for interactive voice-\ndriven agents. Notably, our algorithm does not make use of FFTs or autocorrelation at the\npitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak\npicking and does not require interpolation in time or frequency to obtain high resolution\nestimates; and it works reliably over a four octave range\u2014in real time\u2014without any post-\nprocessing. We have implemented the algorithm in two real-time multimedia applications:\na voice-to-MIDI player and an audiovisual Karaoke machine. More generally, we are using\nthe algorithm to explore novel types of human-computer interaction, as well as studying\nextensions of the algorithm for handling corrupted speech and overlapping speakers.\n\n2 Algorithm\n\nA pitch tracker performs two essential functions: it labels speech as voiced or unvoiced, and\nthroughout segments of voiced speech, it computes a running estimate of the fundamental\nfrequency. Pitch tracking thus depends on the running detection and identi\ufb01cation of peri-\nodic signals in speech. We develop our algorithm for pitch tracking by \ufb01rst examining the\nsimpler problem of detecting sinusoids. For this simpler problem, we describe a solution\nthat does not involve FFTs or autocorrelation at the period of the sinusoid. We then extend\nthis solution to the more general problem of detecting periodic signals in speech.\n\n2.1 Detecting sinusoids\n\nA simple approach to detecting sinusoids is based on viewing them as the solution of a\nsecond order linear difference equation[12]. A discretely sampled sinusoid has the form:\n\nsn = A sin(!n + (cid:18)):\n\n(1)\n\nSinusoids obey a simple difference equation such that each sample sn is proportional to the\naverage of its neighbors 1\n\n2 (sn(cid:0)1 +sn+1), with the constant of proportionality given by:\n\nsn = (cos !)(cid:0)1(cid:20) sn(cid:0)1 + sn+1\n\n2\n\n(cid:21) :\n\n(2)\n\nEq. (2) can be proved using trigonometric identities to expand the terms on the right hand\nside. We can use this property to judge whether an unknown signal xn is approximately\nsinusoidal. Consider the error function:\n\nE((cid:11)) = Xn (cid:20)xn (cid:0) (cid:11)(cid:18) xn(cid:0)1 + xn+1\n\n2\n\n(cid:19)(cid:21)2\n\n:\n\n(3)\n\nIf the signal xn is well described by a sinusoid, then the right hand side of this error function\nwill achieve a small value when the coef\ufb01cient (cid:11) is tuned to match its frequency, as in\neq. (2). The minimum of the error function is found by solving a least squares problem:\n\n(cid:11)(cid:3) =\n\n2Pn xn(xn(cid:0)1 + xn+1)\nPn(xn(cid:0)1 + xn+1)2\n\n:\n\n(4)\n\nThus, to test whether a signal xn is sinusoidal, we can minimize its error function by\neq. (4), then check two conditions: \ufb01rst, that E((cid:11)(cid:3)) (cid:28) E(0), and second, that j(cid:11)(cid:3)j (cid:21) 1.\nThe \ufb01rst condition establishes that the mean squared error is small relative to the mean\n\n\fsquared amplitude of the signal, while the second establishes that the signal is sinusoidal\n(as opposed to exponential), with estimated frequency:\n\n!(cid:3) = cos(cid:0)1(1=(cid:11)(cid:3)):\n\n(5)\n\nThis procedure for detecting sinusoids (known as Prony\u2019s method[12]) has several notable\nfeatures. First, it does not rely on computing FFTs or autocorrelation at the period of the\nsinusoid, but only on computing the zero-lagged and one-sample-lagged autocorrelations\n\nthat appear in eq. (4), namely Pnx2\n\nn and Pnxnxn(cid:6)1. Second, the frequency estimates\n\nare obtained from the solution of a least squares problem, as opposed to the peaks of an\nautocorrelation or FFT, where the resolution may be limited by the sampling rate or signal\nlength. Third, the method can be used in an incremental way to track the frequency of a\nslowly modulated sinusoid. In particular, suppose we analyze sliding windows\u2014shifted by\njust one sample at a time\u2014of a longer, nonstationary signal. Then we can ef\ufb01ciently update\nthe windowed autocorrelations that appear in eq. (4) by adding just those terms generated\nby the rightmost sample of the current window and dropping just those terms generated\nby the leftmost sample of the previous window. (The number of operations per update is\nconstant and does not depend on the window size.)\n\nWe can extract more information from the least squares \ufb01t besides the error in eq. (3)\nand the estimate in eq. (5). In particular, we can characterize the uncertainty in the es-\ntimated frequency. The normalized error function N ((cid:11)) = log[E((cid:11))=E(0)] evaluates the\nleast squares \ufb01t on a dimensionless logarithmic scale that does not depend on the ampli-\ntude of the signal. Let (cid:22) = log(cos(cid:0)1(1=(cid:11))) denote the log-frequency implied by the coef-\n\ufb01cient (cid:11), and let (cid:1)(cid:22)(cid:3) denote the uncertainty in the estimated log-frequency (cid:22)(cid:3) = log !(cid:3).\n(By working in the log domain, we measure uncertainty in the same units as the distance\nbetween notes on the musical scale.) A heuristic measure of uncertainty is obtained by\nevaluating the sharpness of the least squares \ufb01t, as characterized by the second derivative:\n\n(cid:1)(cid:22)(cid:3) = \"(cid:18) @ 2N\n\n@(cid:22)2(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:22)=(cid:22)\n\n2\n\n(cid:3)#(cid:0) 1\n\n=\n\n1\n\n!(cid:3) (cid:18) cos2!(cid:3)\n\nsin !(cid:3) (cid:19)(cid:20)(cid:18) 1\n\nE\n\n(cid:0) 1\n2\n\n:\n\n(cid:3)(cid:21)\n\n(6)\n\n@ 2E\n\n@(cid:11)2(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:11)=(cid:11)\n\nEq. (6) relates sharper \ufb01ts to lower uncertainty, or higher precision. As we shall see, it\nprovides a valuable criterion for comparing the results of different least squares \ufb01ts.\n\n2.2 Detecting voiced speech\n\nOur algorithm for detecting voice speech is a simple extension of the algorithm described\nin the previous section. The algorithm operates on the time domain waveform in a num-\nber of stages, as summarized in Fig. 1. The analysis is based on the assumption that the\nlow frequency spectrum of voiced speech can be modeled as a sum of (noisy) sinusoids\noccurring at integer multiples of the fundamental frequency, f0.\n\nStage 1. Lowpass \ufb01ltering\nThe \ufb01rst stage of the algorithm is to lowpass \ufb01lter the speech, removing energy at fre-\nquencies above 1 kHz. This is done to eliminate the aperiodic component of voiced\nfricatives[17], such as /z/. The signal can be aggressively downsampled after lowpass \ufb01l-\ntering, though the sampling rate should remain at least twice the maximum allowed value\nof f0. The lower sampling rate determines the rate at which the estimates of f0 are updated,\nbut it does not limit the resolution of the estimates themselves. (In our formal evaluations\nof the algorithm, we downsampled from 20 kHz to 4 kHz after lowpass \ufb01ltering; in the\nreal-time multimedia applications, we downsampled from 44.1 kHz to 3675 Hz.)\n\nStage 2. Pointwise nonlinearity\nThe second stage of the algorithm is to pass the signal through a pointwise nonlinearity,\nsuch as squaring or half-wave recti\ufb01cation (which clips negative samples to zero). The\n\n\f\u2019!((/>\n\n)\"&!*\u2019\u2019\n-#)%(+\n\n!\"#$%&#\u2019(\n$\"$)#$(*+#%,\n\n%&\".\"/%*0(\n-#)%(+1*$2\n\n345677.89\n\n475377.89\n\n6775:77.89\n\n\u2019#$<\u2019\"#=\n=(%(/%\"+\u2019\n\n-..@.677.89A\n7\n\n-..@.377.89A\n7\n\n-..@.:77.89A\n7\n\n3775;77.89\n\n-..@.;77.89A\n7\n\n!#%/>\n\n,(\u2019\n\n0\"#/(=A\n\n\u2019>*+!(\u2019%\n(\u2019%#?*%(\n\nFigure 1: Estimating the fundamental frequency f0 of voiced speech without FFTs or auto-\ncorrelation at the pitch period. The speech is lowpass \ufb01ltered (and optionally downsampled)\nto remove fricative noise, then transformed by a pointwise nonlinearity that concentrates\nadditional energy at f0. The resulting signal is analyzed by a bank of bandpass \ufb01lters that\nare narrow enough to resolve the harmonic at f0, but too wide to resolve higher-order har-\nmonics. A resolved harmonic at f0 (essentially, a sinusoid) is detected by a running least\nsquares \ufb01t, and its frequency recovered as the pitch. If more that one sinusoid is detected\nat the outputs of the \ufb01lterbank, the one with the sharpest \ufb01t is used to estimate the pitch; if\nno sinusoid is detected, the speech is labeled as unvoiced. (The two octave \ufb01lterbank in the\n\ufb01gure is an idealization. In practice, a larger bank of narrower \ufb01lters is used.)\n\npurpose of the nonlinearity is to concentrate additional energy at the fundamental, partic-\nularly if such energy was missing or only weakly present in the original signal. In voiced\nspeech, pointwise nonlinearities such as squaring or half-wave recti\ufb01cation tend to create\nenergy at f0 by virtue of extracting a crude representation of the signal\u2019s envelope. This\nis particularly easy to see for the operation of squaring, which\u2014applied to the sum of two\nsinusoids\u2014creates energy at their sum and difference frequencies, the latter of which char-\nacterizes the envelope. In practice, we use half-wave recti\ufb01cation as the nonlinearity in this\nstage of the algorithm; though less easily characterized than squaring, it has the advantage\nof preserving the dynamic range of the original signal.\n\nStage 3. Filterbank\nThe third stage of the algorithm is to analyze the transformed speech by a bank of bandpass\n\ufb01lters. These \ufb01lters are designed to satisfy two competing criteria. On one hand, they are\nsuf\ufb01ciently narrow to resolve the harmonic at f0; on the other hand, they are suf\ufb01ciently\nwide to integrate higher-order harmonics. An idealized two octave \ufb01lterbank that meets\nthese criteria is shown in Fig. 1. The result of this analysis\u2014for voiced speech\u2014is that the\noutput of the \ufb01lterbank consists either of sinusoids at f0 (and not any other frequency), or\nsignals that do not resemble sinusoids at all. Consider, for example, a segment of voiced\nspeech with fundamental frequency f0 = 180 Hz. For such speech, only the second \ufb01lter\nfrom 50-200 Hz will resolve the harmonic at 180 Hz. On the other hand, the \ufb01rst \ufb01lter from\n25-100 Hz will pass low frequency noise; the third \ufb01lter from 100-400 Hz will pass the \ufb01rst\nand second harmonics at 180 Hz and 360 Hz, and the fourth \ufb01lter from 200-800 Hz will\npass the second through fourth harmonics at 360, 540, and 720 Hz. Thus, the output of the\n\ufb01lterbank will consist of a sinusoid at f0 and three other signals that are random or periodic,\nbut de\ufb01nitely not sinusoidal. In practice, we do not use the idealized two octave \ufb01lterbank\nshown in Fig. 1, but a larger bank of narrower \ufb01lters that helps to avoid contaminating the\nharmonic at f0 by energy at 2f0. The bandpass \ufb01lters in our experiments were 8th order\nChebyshev (type I) \ufb01lters with 0.5 dB of ripple in 1.6 octave passbands, and signals were\ndoubly \ufb01ltered to obtain sharp frequency cutoffs.\n\n\fStage 4. Sinusoid detection\nThe fourth stage of the algorithm is to detect sinusoids at the outputs of the \ufb01lterbank.\nSinusoids are detected by the adaptive least squares \ufb01ts described in section 2.1. Running\nestimates of sinusoid frequencies and their uncertainties are obtained from eqs. (5\u20136) and\nupdated on a sample by sample basis for the output of each \ufb01lter. If the uncertainty in any\n\ufb01lter\u2019s estimate is less than a speci\ufb01ed threshold, then the corresponding sample is labeled\nas voiced, and the fundamental frequency f0 determined by whichever \ufb01lter\u2019s estimate has\nthe least uncertainty. (For sliding windows of length 40\u201360 ms, the thresholds typically fall\nin the range 0.08\u20130.12, with higher thresholds required for shorter windows.) Empirically,\nwe have found the uncertainty in eq. (6) to be a better criterion than the error function\nitself for evaluating and comparing the least squares \ufb01ts from different \ufb01lters. A possible\nexplanation for this is that the expression in eq. (6) was derived by a dimensional analysis,\nwhereas the error functions of different \ufb01lters are not even computed on the same signals.\n\nOverall, the four stages of the algorithm are well suited to a real time implementation. The\nalgorithm can also be used for batch processing of waveforms, in which case startup and\nending transients can be minimized by zero-phase forward and reverse \ufb01ltering.\n\n3 Evaluation\n\nThe algorithm was evaluated on a small database of speech collected at the University of\nEdinburgh[1]. The Edinburgh database contains about 5 minutes of speech consisting of\n50 sentences read by one male speaker and one female speaker. The database also contains\nreference f0 contours derived from simultaneously recorded larynogograph signals. The\nsentences in the database are biased to contain dif\ufb01cult cases for f0 estimation, such as\nvoiced fricatives, nasals, liquids, and glides. The results of our algorithm on the \ufb01rst three\nutterances of each speaker are shown in Fig. 2.\n\nA formal evaluation was made by accumulating errors over all utterances in the database,\nusing the reference f0 contours as ground truth[1]. Comparisons between estimated and\nreference f0 values were made every 6.4 ms, as in previous benchmarks. Also, in these\nevaluations, the estimates of f0 from eqs. (4\u20135) were con\ufb01ned to the range 50\u2013250 Hz\nfor the male speaker and the range 120\u2013400 Hz for the female speaker; this was done\nfor consistency with previous benchmarks, which enforced these limits. Note that our\nestimated f0 contours were not postprocessed by a smoothing procedure, such as median\n\ufb01ltering or dynamic programming.\n\nError rates were computed for the fraction of unvoiced (or silent) speech misclassi\ufb01ed as\nvoiced and for the fraction of voiced speech misclassi\ufb01ed as unvoiced. Additionally, for the\nfraction of speech correctly identi\ufb01ed as voiced, a gross error rate was computed measuring\nthe percentage of comparisons for which the reference and estimated f0 differed by more\nthan 20%. Finally, for the fraction of speech correctly identi\ufb01ed as voiced and in which\nthe estimated f0, was not in gross error, a root mean square (rms) deviation was computed\nbetween the reference and estimated f0.\nThe original study on this database published results for a number of approaches to pitch\ntracking. Earlier results, as well as those derived from the algorithm in this paper, are\nshown in Table 1. The overall results show our algorithm\u2014indicated as the adaptive least\nsquares (ALS) approach to pitch tracking\u2014to be extremely competitive in all respects.\nThe only anomaly in these results is the slightly larger rms deviation produced by ALS\nestimation compared to other approaches. The discrepancy could be an artifact of the\n\ufb01ltering operations in Fig. 1, resulting in a slight desychronization of the reference and\nestimated f0 contours. On the other hand, the discrepancy could indicate that for certain\nvoiced sounds, a more robust estimation procedure[12] would yield better results than the\nsimple least squares \ufb01ts in section 2.1.\n\n\f Where can I park my car?\n\n Where can I park my car?\n\n200\n\n)\nz\nH\n\n(\n \n\nh\nc\nt\ni\n\np\n\n150\n\n100\n\nreference\nestimated\n\n0\n\n0.5\n\ntime (sec)\n\n1\n\n1.5\n\n I'd like to leave this in your safe.\n\nreference\nestimated\n\n180\n\n160\n\n140\n\n120\n\n100\n\n80\n\n)\nz\nH\n\n(\n \n\nh\nc\nt\ni\n\np\n\n)\nz\nH\n\n(\n \nh\nc\nt\ni\n\np\n\n300\n\n250\n\n200\n\n)\nz\nH\n\n(\n \n\nh\nc\nt\ni\n\np\n\n350\n\n300\n\n250\n\n200\n\n150\n\nreference\nestimated\n\n1\n\n1.5\n\ntime (sec)\n\n2\n\n2.5\n\n I'd like to leave this in your safe.\n\nreference\nestimated\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n1\ntime (sec)\n\n1.2\n\n1.4\n\n1.6\n\n0.5\n\n1\n\n1.5\n\ntime (sec)\n\n2\n\n2.5\n\n How much are my telephone charges?\n\n How much are my telephone charges?\n\n)\nz\nH\n\n(\n \n\nh\nc\nt\ni\n\np\n\n180\n\n160\n\n140\n\n120\n\n100\n\n80\n\nreference\nestimated\n\n)\nz\nH\n\n(\n \n\nh\nc\nt\ni\n\np\n\n350\n\n300\n\n250\n\n200\n\n150\n\nreference\nestimated\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n1\ntime (sec)\n\n1.2\n\n1.4\n\n1.6\n\n0.5\n\n1\n\n1.5\n\ntime (sec)\n\n2\n\n2.5\n\nFigure 2: Reference and estimated f0 contours for the \ufb01rst three utterances of the male\n(left) and female (right) speaker in the Edinburgh database[1]. Mismatches between the\ncontours reveal voiced and unvoiced errors.\n\n4 Agents\n\nWe have implemented our pitch tracking algorithm as a real time front end for two inter-\nactive voice-driven agents. The \ufb01rst is a voice-to- MIDI player that synthesizes electronic\nmusic from vocalized melodies[4]. Over one hundred electronic instruments are available.\nThe second (see the storyboard in Fig. 3) is a a multimedia Karaoke machine with audiovi-\nsual feedback, voice-driven key selection, and performance scoring. In both applications,\nthe user\u2019s pitch is displayed in real time, scrolling across the screen as he or she sings into\nthe computer. In the Karaoke demo, the correct pitch is also simultaneously displayed,\nproviding an additional element of embarrassment when the singer misses a note. Both\napplications run on a laptop with an external microphone.\n\nInterestingly, the real time audiovisual feedback provided by these agents creates a pro-\nfoundly different user experience than current systems in automatic speech recognition[14].\nUnlike dictation programs or dialog managers, our more primitive agents\u2014which only at-\ntend to pitch contours\u2014are not designed to replace human operators, but to entertain and\namuse in a way that humans cannot. The effect is to enhance the medium of voice, as\nopposed to highlighting the gap between human and machine performance.\n\n\falgorithm\n\nCPD\nFBPT\nHPS\nIPTA\nPP\n\nSPRD\neSPRD\nALS\nCPD\nFBPT\nHPS\nIPTA\nPP\n\nSPRD\neSPRD\nALS\n\nunvoiced\nin error\n\n(%)\n18.11\n3.73\n14.11\n9.78\n7.69\n4.05\n4.63\n4.20\n31.53\n3.61\n19.10\n5.70\n6.15\n2.35\n2.73\n4.92\n\nvoiced\nin error\n\n(%)\n19.89\n13.90\n7.07\n17.45\n15.82\n15.78\n12.07\n11.00\n22.22\n12.16\n21.06\n15.93\n13.01\n12.16\n9.13\n5.58\n\ngross errors\nlow\nhigh\n(%)\n(%)\n0.64\n4.09\n0.64\n1.27\n5.34\n28.15\n0.83\n1.40\n1.74\n0.22\n2.01\n0.62\n0.90\n0.56\n0.20\n0.05\n3.97\n0.61\n3.55\n0.60\n1.61\n0.46\n0.53\n3.12\n0.26\n3.20\n5.56\n0.39\n0.23\n0.43\n0.04\n0.33\n\ndeviation\n\nrms\n\n(Hz)\n3.60\n2.89\n3.21\n3.37\n3.01\n2.46\n1.74\n3.24\n7.61\n7.03\n5.31\n5.35\n6.45\n5.51\n5.13\n6.91\n\nTable 1: Evaluations of different pitch tracking algorithms on male speech (top) and\nfemale speech (bottom). The algorithms in the table are cepstrum pitch determination\n(CPD)[9], feature-based pitch tracking (FBPT)[11], harmonic product spectrum (HPS)\npitch determination[10, 15], parallel processing (PP) of multiple estimators in the time\ndomain[5], integrated pitch tracking (IPTA)[16], super resolution pitch determination\n(SRPD)[8], enhanced SRPD (eSRPD)[1], and adaptive least squares (ALS) estimation, as\ndescribed in this paper. The benchmarks other than ALS were previously reported[1]. The\nbest results in each column are indicated in boldface.\n\nFigure 3: Screen shots from the multimedia Karoake machine with voice-driven key se-\nlection, audiovisual feedback, and performance scoring. From left to right: splash screen;\nsinging \u201chappy birthday\u201d; machine evaluation.\n\n5 Future work\n\nVoice is the most natural and expressive medium of human communication. Tapping the\nfull potential of this medium remains a grand challenge for researchers in arti\ufb01cial intelli-\ngence (AI) and human-computer interaction. In most situations, a speaker\u2019s intentions are\nderived not only from the literal transcription of his speech, but also from prosodic cues,\nsuch as pitch, stress, and rhythm. The real time processing of such cues thus represents a\nfundamental challenge for autonomous, voice-driven agents. Indeed, a machine that could\nlearn from speech as naturally as a newborn infant\u2014responding to prosodic cues but rec-\nognizing in fact no words\u2014would constitute a genuine triumph of AI.\n\n\fWe are pursuing the ideas in this paper with this vision in mind, looking beyond the im-\nmediate applications to voice-to-midi synthesis and audiovisual Karaoke. The algorithm in\nthis paper was purposefully limited to clean speech from non-overlapping speakers. While\nthe algorithm works well in this domain, we view it mainly as a vehicle for experiment-\ning with non-traditional methods that avoid FFTs and autocorrelation and that (ultimately)\nmight be applied to more complicated signals. We have two main goals for future work:\n\ufb01rst, to add more sophisticated types of human-computer interaction to our voice-driven\nagents, and second, to incorporate the novel elements of our pitch tracker into a more com-\nprehensive front end for auditory scene analysis[2, 3]. The agents need to be suf\ufb01ciently\ncomplex to engage humans in extended interactions, as well as suf\ufb01ciently robust to han-\ndle corrupted speech and overlapping speakers. From such agents, we expect interesting\npossibilities to emerge.\n\nReferences\n[1] P. C. Bagshaw, S. M. Hiller, and M. A. Jack. Enhanced pitch tracking and the processing\nIn Proceedings of the 3rd European\n\nof f0 contours for computer aided intonation teaching.\nConference on Speech Communication and Technology, volume 2, pages 1003\u20131006, 1993.\n\n[2] A. S. Bregman. Auditory scene analysis: the perceptual organization of sound. M.I.T. Press,\n\nCambridge, MA, 1994.\n\n[3] M. Cooke and D. P. W. Ellis. The auditory organization of speech and other sources in listeners\n\nand computational models. Speech Communication, 35:141\u2013177, 2001.\n\n[4] P. de la Cuadra, A. Master, and C. Sapp. Ef\ufb01cient pitch detection techniques for interactive\nIn Proceedings of the 2001 International Computer Music Conference, La Habana,\n\nmusic.\nCuba, September 2001.\n\n[5] B. Gold and L. R. Rabiner. Parallel processing techniques for estimating pitch periods of speech\nin the time domain. Journal of the Acoustical Society of America, 46(2,2):442\u2013448, August\n1969.\n\n[6] W. M. Hartmann. Pitch, periodicity, and auditory organization. Journal of the Acoustical Society\n\nof America, 100(6):3491\u20133502, 1996.\n\n[7] W. Hess. Pitch Determination of Speech Signals: Algorithms and Devices. Springer, 1983.\n[8] Y. Medan, E. Yair, and D. Chazan. Super resolution pitch determination of speech signals. IEEE\n\nTransactions on Signal Processing, 39(1):40\u201348, 1991.\n\n[9] A. M. Noll. Cepstrum pitch determination. Journal of the Acoustical Society of America,\n\n41(2):293\u2013309, 1967.\n\n[10] A. M. Noll. Pitch determination of human speech by the harmonic product spectrum, the har-\nmonic sum spectrum, and a maximum likelihood estimate. In Proceedings of the Symposium\non Computer Processing in Communication, pages 779\u2013798, April 1969.\n\n[11] M. S. Phillips. A feature-based time domain pitch tracker. Journal of the Acoustical Society of\n\nAmerica, 79:S9\u2013S10, 1985.\n\n[12] J. G. Proakis, C. M. Rader, F. Ling, M. Moonen, I. K. Proudler, and C. L. Nikias. Algorithms\n\nfor Statistical Signal Processing. Prentice Hall, 2002.\n\n[13] L. R. Rabiner. On the use of autocorrelation analysis for pitch determination. IEEE Transactions\n\non Acoustics, Speech, and Signal Processing, 25:22\u201333, 1977.\n\n[14] L. R. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Engle-\n\nwoods Cliffs, NJ, 1993.\n\n[15] M. R. Schroeder. Period histogram and product spectrum: new methods for fundamental fre-\n\nquency measurement. Journal of the Acoustical Society of America, 43(4):829\u2013834, 1968.\n\n[16] B. G. Secrest and G. R. Doddington. An integrated pitch tracking algorithm for speech systems.\nIn Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal\nProcessing, pages 1352\u20131355, Boston, 1983.\n\n[17] K. Stevens. Acoustic Phonetics. M.I.T. Press, Cambridge, MA, 1999.\n\n\f", "award": [], "sourceid": 2289, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}