{"title": "Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1240, "abstract": null, "full_text": " Real-Time Pitch Determination of One or More\n Voices by Nonnegative Matrix Factorization\n\n\n\n Fei Sha and Lawrence K. Saul\n Dept. of Computer and Information Science\n University of Pennsylvania, Philadelphia, PA 19104\n {feisha,lsaul}@cis.upenn.edu\n\n\n\n Abstract\n\n An auditory \"scene\", composed of overlapping acoustic sources, can be\n viewed as a complex object whose constituent parts are the individual\n sources. Pitch is known to be an important cue for auditory scene analy-\n sis. In this paper, with the goal of building agents that operate in human\n environments, we describe a real-time system to identify the presence of\n one or more voices and compute their pitch. The signal processing in the\n front end is based on instantaneous frequency estimation, a method for\n tracking the partials of voiced speech, while the pattern-matching in the\n back end is based on nonnegative matrix factorization, an unsupervised\n algorithm for learning the parts of complex objects. While supporting a\n framework to analyze complicated auditory scenes, our system maintains\n real-time operability and state-of-the-art performance in clean speech.\n\n\n\n1 Introduction\n\nNonnegative matrix factorization (NMF) is an unsupervised algorithm for learning the parts\nof complex objects [11]. The algorithm represents high dimensional inputs (\"objects\") by\na linear superposition of basis functions (\"parts\") in which both the linear coefficients and\nbasis functions are constrained to be nonnegative. Applied to images of faces, NMF learns\nbasis functions that correspond to eyes, noses, and mouths; applied to handwritten digits,\nit learns basis functions that correspond to cursive strokes. The algorithm has also been\nimplemented in real-time embedded systems as part of a visual front end [10].\n\nRecently, it has been suggested that NMF can play a similarly useful role in speech and au-\ndio processing [16, 17]. An auditory \"scene\", composed of overlapping acoustic sources,\ncan be viewed as a complex object whose constituent parts are the individual sources.\nPitch is known to be an extremely important cue for source separation and auditory scene\nanalysis [4]. It is also an acoustic cue that seems amenable to modeling by NMF. In partic-\nular, we can imagine the basis functions in NMF as harmonic stacks of individual periodic\nsources (e.g., voices, instruments), which are superposed to give the magnitude spectrum\nof a mixed signal. The pattern-matching computations of NMF are reminiscent of long-\nstanding template-based models of pitch perception [6].\n\nOur interest in NMF lies mainly in its use for speech processing. In this paper, we describe\na real-time system to detect the presence of one or more voices and determine their pitch.\n\n\f\nLearning plays a crucial role in our system: the basis functions of NMF are trained offline\nfrom data to model the particular timbres of voiced speech, which vary across different\nphonetic contexts and speakers. In related work, Smaragdis and Brown used NMF to model\npolyphonic piano music [17]. Our work differs in its focus on speech, real-time processing,\nand statistical learning of basis functions.\n\nA long-term goal is to develop interactive voice-driven agents that respond to the pitch\ncontours of human speech [15]. To be truly interactive, these agents must be able to process\ninput from distant sources and to operate in noisy environments with overlapping speakers.\nIn this paper, we have taken an important step toward this goal by maintaining real-time\noperability and state-of-the-art performance in clean speech while developing a framework\nthat can analyze more complicated auditory scenes. These are inherently competing goals\nin engineering. Our focus on actual system-building also distinguishes our work from many\nother studies of overlapping periodic sources [5, 9, 19, 20, 21].\n\nThe organization of this paper is as follows. In section 2, we describe the signal processing\nin our front end that converts speech signals into a form that can be analyzed by NMF. In\nsection 3, we describe the use of NMF for pitch tracking--namely, the learning of basis\nfunctions for voiced speech, and the nonnegative deconvolution for real-time analysis. In\nsection 4, we present experimental results on signals with one or more voices. Finally, in\nsection 5, we conclude with plans for future work.\n\n\n2 Signal processing\n\nA periodic signal is characterized by its fundamental frequency, f0. It can be decomposed\nby Fourier analysis as the sum of sinusoids--or partials--whose frequencies occur at inte-\nger multiples of f0. For periodic signals with unknown f0, the frequencies of the partials\ncan be inferred from peaks in the magnitude spectrum, as computed by an FFT.\n\nVoiced speech is perceived as having a pitch at the fundamental frequency of vocal cord vi-\nbration. Perfect periodicity is an idealization, however; the waveforms of voiced speech are\nnon-stationary, quasiperiodic signals. In practice, one cannot reliably extract the partials\nof voiced speech by simply computing windowed FFTs and locating peaks in the magni-\ntude spectrum. In this section, we review a more robust method, known as instantaneous\nfrequency (IF) estimation [1], for extracting the stable sinusoidal components of voiced\nspeech. This method is the basis for the signal processing in our front-end.\n\nThe starting point of IF estimation is to model the voiced speech signal, s(t), by a sum of\namplitude and frequency-modulated sinusoids:\n\n t\n s(t) = i(t) cos dt i(t) + i . (1)\n 0\n i\n\nThe arguments of the cosines in eq. (1) are called the instantaneous phases; their deriva-\ntives with respect to time yield the so-called instantaneous frequencies i(t). If the am-\nplitudes i(t) and frequencies i(t) are stationary, then eq. (1) reduces to a weighted sum\nof pure sinusoids. For nonstationary signals, i(t) intuitively represents the instantaneous\nfrequency of the ith partial at time t.\n\nThe short-time Fourier transform (STFT) provides an efficient tool for IF estimation [2].\nThe STFT of s(t) with windowing function w(t) is given by:\n\n F (, t) = d s( )w( - t)e-j . (2)\n\n\nLet z(, t) = ejtF (, t) denote the analytic signal of the Fourier component of s(t) with\nfrequency , and let a = Re[z] and b = Im[z] denote its real and imaginary parts. We\n\n\f\n 1000\n\n 800\n\n 600\n\n 400\n\n 200\n\n Instantaneous Frequency (Hz) 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0\n\n 200\n\n 100\n\n Pitch (Hz) 00 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0\n Time (second)\n\n\n\n\nFigure 1: Top: instantaneous frequencies of estimated partials for the utterance \"The north\nwind and the sun were disputing.\" Bottom: f0 contour derived from a laryngograph record-\ning.\n\n\n\ncan define a mapping from the time-frequency plane of the STFT to another frequency\naxis (, t) by:\n a b - b a\n (, t) = arg[z(, t)] = t t (3)\n t a2 + b2\nThe derivatives on the right hand side can be computed efficiently via SFFTs [2]. Note\nthat the right hand side of eq. (3) differentiates the instantaneous phase associated with a\nparticular Fourier component of s(t). IF estimation identifies the stable fixed points [7, 8]\nof this mapping, given by\n\n (, t) = and (/)| < 1, (4)\n =\n\nas the instantaneous frequencies of the partials that appear in eq. (1). Intuitively, these fixed\npoints occur where the notions of energy at frequency in eqs. (1) and (2) coincide--that\nis, where the IF and STFT representations appear most consistent.\n\nThe top panel of Fig. 1 shows the IFs of partials extracted by this method for a speech\nsignal with sliding and overlapping analysis windows. The bottom panels shows the pitch\ncontour. Note that in regions of voiced speech, indicated by nonzero f0 values, the IFs\nexhibit a clear harmonic structure, while in regions of unvoiced speech, they do not.\n\nIn summary, the signal processing in our front-end extracts partials with frequencies (t)\n i\nand nonnegative amplitudes |F ((t), t)|\n i , where t indexes the time of the analysis window\nand i indexes the number of extracted partials. Further analysis of the signal is performed\nby the NMF algorithm described in the next section, which is used to detect the presence\nof one or more voices and to estimate their f0 values. Similar front ends have been used in\nother studies of pitch tracking and source separation [1, 2, 7, 13].\n\n\n3 Nonnegative matrix factorization\n\nFor mixed signals of overlapping speakers, our front-end outputs the mixture of partials\nextracted from several voices. How can we analyze this output by NMF? In this section,\nwe show: (i) how to learn nonnegative basis functions that model the characteristic timbres\nof voiced speech, and (ii) how to decompose mixed signals in terms of these basis functions.\n\nWe briefly review NMF [11]. Given observations yt, the goal of NMF is to compute basis\nfunctions W and linear coefficients xt such that the reconstructed vectors ~\n yt = Wxt\n\n\f\nbest match the original observations. The observations, basis functions, and coefficients\nare constrained to be nonnegative. Reconstruction errors are measured by the generalized\nKullback-Leibler divergence:\n\n G(y, ~\n y) = [y log(y/~\n y) - y + ~\n y] , (5)\n \n\nwhich is lower bounded by zero and vanishes if and only if y = ~\n y. NMF works by opti-\nmizing the total reconstruction error G(y\n t t, ~\n yt) in terms of the basis functions W and\ncoefficients xt. We form three matrices by concatenating the column vectors yt, ~\n yt and xt\nseparately and denote them by Y, ~\n Y and X respectively. Multiplicative updates for the\noptimization problem are given in terms of the elements of these matrices:\n\n Y W\n Yt/ ~\n Yt\n W t \n W Xt , X . (6)\n ~ t Xt\n Y W\n t t \nThese alternating updates are guaranteed to converge to a local minimum of the total re-\nconstruction error; see [11] for further details.\n\nIn our application of NMF to pitch estimation, the vectors yt store vertical \"time slices\"\nof the IF representation in Fig. 1. Specifically, the elements of yt store the magnitude\nspectra |F ((t), t)|\n i of extracted partials at time t; the instantaneous frequency axis is dis-\ncretized on a log scale so that each element of yt covers 1/36 octave of the frequency spec-\ntrum. The columns of W store basis functions, or harmonic templates, for the magnitude\nspectra of voiced speech with different fundamental frequencies. (An additional column\nin W stores a non-harmonic template for unvoiced speech.) In this study, only one har-\nmonic template was used per fundamental frequency. The fundamental frequencies range\nfrom 50Hz to 400Hz, spaced and discretized on a log scale. We constrained the harmonic\ntemplates for different fundamental frequencies to be related by a simple translation on\nthe log-frequency axis. Tying the columns of W in this way greatly reduces the number\nof parameters that must be estimated by a learning algorithm. Finally, the elements of xt\nstore the coefficients that best reconstruct yt by linearly superposing harmonic templates\nof W. Note that only partials from the same source form harmonic relations. Thus, the\nnumber of nonzero elements in xt indicates the number of periodic sources at time t, while\nthe indices of nonzero elements indicate their fundamental frequencies. It is in this sense\nthat the reconstruction yt Wxt provides an analysis of the auditory scene.\n\n\n3.1 Learning the basis functions of voiced speech\n\nThe harmonic templates in W were estimated from the voiced speech of (non-overlapping)\nspeakers in the Keele database [14]. The Keele database provides aligned pitch contours\nderived from laryngograph recordings. The first halves of all utterances were used for\ntraining, while the second halves were reserved for testing. Given the vectors yt computed\nby IF estimation in the front end, the problem of NMF is to estimate the columns of W and\nthe reconstruction coefficients xt. Each xt has only two nonzero elements (one indicating\nthe reference value for f0, the other corresponding to the non-harmonic template of the\nbasis matrix W); their magnitudes must still be estimated by NMF. The estimation was\nperformed by iterating the updates in eq. (6).\n\nFig. 2 (left) compares the harmonic template at 100 Hz before and after learning. While\nthe template is initialized with broad spectral peaks, it is considerably sharpened by the\nNMF learning algorithm. Fig. 2 (right) shows four examples from the Keele database\n(from snippets of voiced speech with f0 = 100 Hz) that were used to train this template.\nNote that even among these four partial profiles there is considerable variance. The learned\ntemplate is derived to minimize the total reconstruction error over all segments of voiced\nspeech in the training data.\n\n\f\n 1\n\n female: cloak male: stronger\n\n\n 0.5\n\n\n\n\n 00 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500\n\n\n 1\n\n male: travel male: the\n\n\n 0.5\n\n\n\n\n 00 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500\n Frequency (Hz) Frequency (Hz) Frequency (Hz)\n\n\n\n\n\nFigure 2: Left: harmonic template before and after learning for voiced speech at\nf0 = 100 Hz. The learned template (bottom) has a much sharper spectral profile. Right:\nobserved partials from four speakers with f0 = 100 Hz.\n\n\n\n3.2 Nonnegative deconvolution for estimating f0 of one or more voices\n\nOnce the basis functions in W have been estimated, computing x such that y Wx under\nthe measure of eq. (5) simplifies to the problem of nonnegative deconvolution. Nonnegative\ndeconvolution has been applied to problems in fundamental frequency estimation [16],\nmusic analysis [17] and sound localization [12].\n\nIn our model, nonnegative deconvolution of y Wx yields an estimate of the number\nof periodic sources in y as well as their f0 values. Ideally, the number of nonzero recon-\nstruction weights in x reveal the number of sources, and the corresponding columns in\nthe basis matrix W reveal their f0 values. In practice, the index of the largest component\nof x is found, and its corresponding f0 value is deemed to be the dominant fundamental\nfrequency. The second largest component of x is then used to extract a secondary funda-\nmental frequency, and so on. A thresholding heuristic can be used to terminate the search\nfor additional sources. Unvoiced speech is detected by a simple frame-based classifier\ntrained to make voiced/unvoiced distinctions from the observation y and its nonnegative\ndeconvolution x.\n\nThe pattern-matching computations in NMF are reminiscent of well-known models of har-\nmonic template matching [6]. Two main differences are worth noting. First, the templates\nin NMF are learned from labeled speech data. We have found this to be essential in their\ngeneralization to unseen cases. It is not obvious how to craft a harmonic template \"by\nhand\" that manages the variability of partial profiles in Fig. 2 (right). Second, the template\nmatching in NMF is framed by nonnegativity constraints. Specifically, the algorithm mod-\nels observed partials by a nonnegative superposition of harmonic stacks. The cost function\nin eq. (5) also diverges if ~\n y = 0 when y is nonzero; this useful property ensures that min-\nima of eq. (5) must explain each observed partial by its attribution to one or more sources.\nThis property does not hold for traditional least-squares linear reconstructions.\n\n\n4 Implementation and results\n\n\nWe have implemented both the IF estimation in section 2 and the nonnegative deconvolu-\ntion in section 3.2 in a real-time system for pitch tracking. The software runs on a laptop\ncomputer with a visual display that shows the contour of estimated f0 values scrolling in\nreal-time. After the signal is downsampled to 4900 Hz, IF estimation is performed in 10 ms\nshifts with an analysis window of 50 ms. Partials extracted from the fixed points of eq. (4)\nare discretized on a log-frequency axis. The columns of the basis matrix W provide har-\n\n\f\n Keele database\n VE (%) UE (%) GPE (%) RMS (Hz)\n NMF 7.7 4.6 0.9 4.3\n RAPT 3.2 6.8 2.2 4.4\n\n Edinburgh database\n VE (%) UE (%) GPE (%) RMS (Hz)\n NMF 7.8 4.4 0.7 5.8\n RAPT 4.5 8.4 1.9 5.3\n\nTable 1: Comparison between our algorithm and RAPT [18] on the test portion of the Keele\ndatabase (see text) and the full Edinburgh database, in terms of the percentages of voiced\nerrors (VE), unvoiced errors (UE), and gross pitch errors (GPE), as well as the root mean\nsquare (RMS) deviation in Hz.\n\n\n\nmonic templates for f0 = 50 Hz to f0 = 400 Hz with a step size of 1/36 octave. To\nachieve real-time performance and reduce system latency, the system does not postpro-\ncess the f0 values obtained in each frame from nonnegative deconvolution: in particular,\nthere is no dynamic programming to smooth the pitch contour, as commonly done in many\npitch tracking algorithms [18]. We have found that our algorithm performs well and yields\nsmooth pitch contours (for non-overlapping voices) even without this postprocessing.\n\n\n4.1 Pitch determination of clean speech signals\n\nTable 1 compares the performance of our algorithm on clean speech to RAPT [18], a state-\nof-the-art pitch tracker based on autocorrelation and dynamic programming. Four error\ntypes are reported: the percentage of voiced frames misclassified as unvoiced (VE), the per-\ncentage of unvoiced frames misclassified as voiced (UE), the percentage of voiced frames\nwith gross pitch errors (GPE) where predicted and reference f0 values differ by more than\n20%, and the root-mean-squared (RMS) difference between predicted and reference f0 val-\nues when there are no gross pitch errors. The results were obtained on the second halves\nof utterances reserved for testing in the Keele database, as well as the full set of utterances\nin the Edinburgh database [3]. As shown in the table, the performance of our algorithm is\ncomparable to that of RAPT.\n\n\n4.2 Pitch determination of overlapping voices and noisy speech\n\nWe have also examined the robustness of our system to noise and overlapping speakers.\nFig. 3 shows the f0 values estimated by our algorithm from a mixture of two voices--one\nwith ascending pitch, the other with descending pitch. Each voice spans one octave. The\ndominant and secondary f0 values extracted in each frame by nonnegative deconvolution\nare shown. The algorithm recovers the f0 values of the individual voices almost perfectly,\nthough it does not currently make any effort to track the voices through time. (This is a\nsubject for future work.)\n\nFig. 4 shows in more detail how IF estimation and nonnegative deconvolution are affected\nby interfering speakers and noise. A clean signal from a single speaker is shown in the\ntop row of the plot, along with its log power spectra, partials extracted by IF estimation,\nestimated f0, and reconstructed harmonic stack. The second and third rows show the effects\nof adding white noise and an overlapping speaker, respectively. Both types of interference\ndegrade the harmonic structure in the log power spectra and extracted partials. However,\nnonnegative deconvolution is still able to recover the pitch of the original speaker, as well\nas the pitch of the second speaker. On larger evaluations of the algorithm's robustness, we\nhave obtained results comparable to RAPT over a wide range of SNRs (as low as 0 dB).\n\n\f\n 1000 200\n dominant pitch\n secondary pitch\n 800\n\n 150\n 600\n\n (Hz) 0\n 400 F\n Frequency(Hz) 100\n\n 200\n\n\n\n 0 50\n 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3\n Time (s) Time (s)\n\n\nFigure 3: Left: Spectrogram of a mixture of two voices with ascending and descending f0\ncontours. Right: f0 values estimated by NMF.\n\n\n\n Waveform Log Power Spectra Y Deconvoluted X Reconstructed Y\n\n\n\n\n\n Clean\n\n\n\n\n\n White noise added\n\n\n\n\n\n Mix of two signals\n 50 100 150 200 250 0 500 1000 1500 500 1000 1500 0 200 400 500 1000 1500\n Time Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz)\n\n\n\n\nFigure 4: Effect of white noise (middle row) and overlapping speaker (bottom row) on\nclean speech (top row). Both types of interference degrade the harmonic structure in the log\npower spectra (second column) and the partials extracted by IF estimation (third column).\nThe results of nonnegative deconvolution (fourth column), however, are fairly robust. Both\nthe pitch of the original speaker at f0 = 200 Hz and the overlapping speaker at f0 = 300 Hz\nare recovered. The fifth column displays the reconstructed profile of extracted partials from\nactivated harmonic templates.\n\n\n\n5 Discussion\n\n\nThere exists a large body of related work on fundamental frequency estimation of over-\nlapping sources [5, 7, 9, 19, 20, 21]. Our contributions in this paper are to develop a new\nframework based on recent advances in unsupervised learning and to study the problem\nwith the constraints imposed by real-time system building. Nonnegative deconvolution\nis similar to EM algorithms [7] for harmonic template matching, but it does not impose\nnormalization constraints on spectral peaks as if they represented a probability distribution.\nImportant directions for future work are to train a richer set of harmonic templates by NMF,\nto incorporate the frame-based computations of nonnegative deconvolution into a dynami-\ncal model, and to embed our real-time system in interactive agents that respond to the pitch\ncontours of human speech. All these directions are being actively pursued.\n\n\f\nReferences\n\n [1] T. Abe, T. Kobayashi, and S. Imai. Harmonics tracking and pitch extraction based on instanta-\n neous frequency. In Proc. of ICASSP, pages 756759, 1995.\n\n [2] T. Abe, T. Kobayashi, and S. Imai. Robust pitch estimation with harmonics enhancement in\n noisy environments based on instantaneous frequency. In Proc. of ICSLP, pages 12771280,\n 1996.\n\n [3] P. Bagshaw, S. M. Hiller, and M. A. Jack. Enhanced pitch tracking and the processing of\n f0 contours for computer aided intonation teaching. In Proc. of 3rd European Conference on\n Speech Communication and Technology, pages 10031006, 1993.\n\n [4] A. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press,\n 2nd edition, 1999.\n\n [5] A. de Cheveigne and H. Kawahara. Multiple period estimation and pitch perception model.\n Speech Communication, 27:175185, 1999.\n\n [6] J. Goldstein. An optimum processor theory for the central formation of the pitch of complex\n tones. J. Acoust. Soc. Am., 54:14961516, 1973.\n\n [7] M. Goto. A robust predominant-F0 estimation method for real-time detection of melody and\n bass lines in CD recordings. In Proc. of ICASSP, pages 757760, June 2000.\n\n [8] H. Kawahara, H. Katayose, A. de Cheveigne, and R. D. Patterson. Fixed point analysis of\n frequency to instantaneous frequency mapping for accurate estimation of f0 and periodicity. In\n Proc. of EuroSpeech, pages 27812784, 1999.\n\n [9] A. Klapuri, T. Virtanen, and J.-M. Holm. Robust multipitch estimation for the analysis and\n manipulation of polyphonic musical signals. In Proc. of COST-G6 Conference on Digital Audio\n Effects, Verona, Italy, 2000.\n\n[10] D. D. Lee and H. S. Seung. Learning in intelligent embedded systems. In Proc. of USENIX\n Workshop on Embedded Systems, 1999.\n\n[11] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.\n Nature, 401:788791, 1999.\n\n[12] Y. Lin, D. D. Lee, and L. K. Saul. Nonnegative deconvolution for time of arrival estimation. In\n Proc. of ICASSP, 2004.\n\n[13] T. Nakatani and T. Irino. Robust fundamental frequency estimation against background noise\n and spectral distortion. In Proc. of ICSLP, pages 17331736, 2002.\n\n[14] F. Plante, G. F. Meyer, and W. A. Ainsworth. A pitch extraction reference database. In Proc. of\n EuroSpeech, pages 837840, 1995.\n\n[15] L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun. Real time voice processing with audiovisual\n feedback: toward autonomous agents with perfect pitch. In S. Becker, S. Thrun, and K. Ober-\n mayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003.\n\n[16] L. K. Saul, F. Sha, and D. D. Lee. Statistical signal processing with nonnegativity constraints.\n In Proc. of EuroSpeech, pages 10011004, 2003.\n\n[17] P. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music tran-\n scription. In Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acous-\n tics, pages 177180, 2003.\n\n[18] D. Talkin. A robust algorithm for pitch tracking(RAPT). In W. B. Kleijn and K. K. Paliwal,\n editors, Speech Coding and Synthesis, chapter 14. Elsevier Science B.V., 1995.\n\n[19] T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. IEEE\n Trans. on Speech and Audio Processing, 8(6):708716, 2000.\n\n[20] T. Virtanen and A. Klapuri. Separation of harmonic sounds using multipitch analysis and itera-\n tive parameter estimation. In Proc. of IEEE Workshop on Applications of Signal Processing to\n Audio and Acoustics, pages 8386, New Paltz, NY, USA, Oct 2001.\n\n[21] M. Wu, D. Wang, and G. J. Brown. A multipitch tracking algorithm for noisy speech. IEEE\n Trans. on Speech and Audio Processing, 11:229241, 2003.\n\n\f\n", "award": [], "sourceid": 2631, "authors": [{"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}