{"title": "JANUS: Speech-to-Speech Translation Using Connectionist and Non-Connectionist Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 190, "abstract": null, "full_text": "JANUS: Speech-to-Speech Translation Using \n\nConnectionist and Non-Connectionist Techniques \n\nAlex Waibel\u00b7  Ajay N. Jain t \nArthur McNair  Joe Tebelskis \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nLouise OsterhoItz \n\nComputational Linguistics Program \n\nCarnegie Mellon University \n\nHiroaki Saito \nKeio University \nTokyo, Japan \n\nOtto Schmidbauer \nSiemens Corporation \nMunich, Germany \n\nTilo Sloboda  Monika Woszczyna \n\nUniversity of Karlsruhe \n\nKarlsruhe, Germany \n\nABSTRACT \n\nWe present JANUS, a speech-to-speech translation system that utilizes \ndiverse processing strategies, including connectionist learning, tradi(cid:173)\ntional AI knowledge representation approaches, dynamic programming, \nand stochastic techniques. JANUS  translates continuously spoken \nEnglish and German into German, English, and Japanese. JANUS cur(cid:173)\nrently achieves 87%  translation fidelity from English speech and 97% \nfrom German speech. We present the JANUS system along with com(cid:173)\nparative evaluations of its interchangeable processing components, with \nspecial emphasis on the connectionist modules. \n\n\u2022 Also with University of Karlsruhe, Karlsruhe. Germany. \n1N\"ow with Alliant Techsystems Research and Technology Center. Hopkins. Minnesota. \n\n183 \n\n\f184  Waibel,  et al. \n\n1  INTRODUCTION \n\nIn an age of increasing globalization of our economies and ever more efficient communi(cid:173)\ncation media. one important challenge is the  need for  effective ways of overcoming lan(cid:173)\nguage  barriers.  Human  translation  efforts  are  generally  expensive  and  slow.  thus \neliminating this possibility between individuals and around rapidly changing material (e.g. \nnewscasts. newspapers). This need has recently lead to a resurgence of effort in machine \ntranslation-mostly of written language. \n\nMuch of human communication. however. is spoken, and the problem of spoken language \ntranslation must also be addressed. If successful. speech-to-text translation systems could \nlead  to  automatic  subtitles  in  TV-broadcasts  and  cross-linguistic  dictation.  Speech-to(cid:173)\nspeech translation could be deployed as interpreting telephone service in restricted domains \nsuch as cross-linguistic hoteVconference reservations. catalog purchasing, travel planning, \netc., and  eventually  in general domains.  such as person-to-person  telephone calls.  Apart \nfrom  telephone  service.  speech translation could facilitate  multilingual  negotiations  and \ncollaboration in face-to-face or video-conferencing settings. \n\nWith  the potential applications  so promising,  what are the scientific challenges?  Speech \ntranslation systems will need to address three distinct problems: \n\n\u2022  Speech Recognition and Understanding:  A naturally spoken utterance must be recog(cid:173)\n\nnized and understood in the context of ongoing dialog. \n\n\u2022  Machine  Translation:  A recognized message must be translated from  one language \n\ninto another (or several others). \n\n\u2022  Speech Synthesis:  A translated message must be synthesized in the target language. \n\nConsiderable challenges still face the development of each of the components, let alone the \ncombination of the three. Among them only speech synthesis is mature enough for com(cid:173)\nmercial systems to exist that can synthesize intelligible speech in several languages from \ntext But even here, to guarantee acceptance of the translation system, research is needed to \nimprove  naturalness and to  allow  for adaptation of the output speech (in  the  target lan(cid:173)\nguage) to the voice characteristics of the input speaker. Speech recognition systems to date \nare generally limited in vocabulary size. and can only accept grammatically  well-formed \nutterances. They require improvement to handle spontaneous unrestricted dialogs. Machine \nTranslation systems require considerable development effort to  work in a given language \npair and  domain reasonably  well,  and generally  require  syntactically  well-formed input \nsentences. Improvements are needed to handle ill-formed sentences well and to allow for \nflexibility in the face of changes in domain and language pairs. \n\nBeyond the challenges facing  each system component, the combination of the three also \nintroduces extra difficulties. Both the speech recognition and machine translation compo(cid:173)\nnents, must deal with spoken languager-ill-formed noisy input, both acoustically as well \nas syntactically. Therefore, the speech recognition component must be concerned less with \ntranscription fidelity  than semantic fidelity,  while the MT-component must try to capture \nthe meaning or intent of the input sentence without being guaranteed a syntactically legal \nsequence of words.  In addition,  non-symbolic prosodic information  (intonation, rhythm, \netc.) and dialog state must be taken into consideration to properly translate an input utter(cid:173)\nance.  A closer cooperation between traditional signal processing and language level pro(cid:173)\ncessing must be achieved. \n\n\fJANUS: Speech-to-Speech Translation \n\n185 \n\nInput \n\nUtterance \n\n~  / \n\n(  Speech  1.( \n\nSystem \n\nPARSEC \nNetwork \n\nParse \n\nLR \n\nParser \n\n)-( :;::: )--\n\nTranslated \nUtterance \n\n1 \n\nDecTalk \nDTCOI \n\nFigure  1:  High-level JANUS architecture \n\nJANUS is our first attempt at multilingual speech translation. It is the result of a collabora(cid:173)\ntive effort between AlR Interpreting Telephony Research Laboratories, Carnegie Mellon \nUniversity,  Siemens  Corporation,  and  the  University  of  Karlsruhe.  JANUS  currently \naccepts continuously spoken sentences from a conference registration scenario, where a fic(cid:173)\ntitious caller attempts to register to an international conference. The dialogs are read aloud \nfrom dialog scripts that make use of a vocabulary of approximately 400 words.  Speaker(cid:173)\ndependent and independent versions of the input recognition systems have been developed. \nJANUS  currently accepts  continuously  spoken English and German  input and produces \nspoken German, English, and Japanese output as a result. \n\nWhile JANUS has some of the limitations mentioned above, it is the first tri-lingual contin(cid:173)\nuous large vocabulary speech translation system to-date. It is a vehicle toward overcoming \nsome of the limitations described. A particular focus is the trainability of system compo(cid:173)\nnents, so that flexible, adaptive, and robust systems may result. JANUS is a hybrid system \nthat  uses  a  blend  of computational  strategies:  connectionist,  statistical  and  knowledge \nbased techniques. This paper will describe each of JANUS's processing components sepa(cid:173)\nrately  and  particularly  highlight  the  relative  contributions  of connectionist  techniques \nwithin this ensemble. Figure 1 shows a high-level diagram of JANUS's components. \n\n2  SPEECH RECOGNITION \n\nTwo alternative speech recognition systems are currently used in JANUS: Linked Predic(cid:173)\ntive Neural Networks (LPNNs) and Learned Vector Quantization networks (LVQ) (Tebel(cid:173)\nskis et al.  1991;  Schmidbauer and Tebelskis 1992). They are both connectionist, \ncontinuous-speech recognition systems, and both have vocabularies of approximately 400 \nEnglish and 400 German words. Each use statistical bigram or word-pair grammars \nderived from  the conference registration database. The systems are based on canonical \nphoneme models (states) that can be logically concatenated in any order to create models \nfor different words. The need for training data with labeled phonemes can be reduced by \nfirst bootstrapping the networks on a small amount of speech with forced phoneme bound(cid:173)\naries, then training on the whole database using only forced word boundaries. \n\nIn the LPNN system, each phoneme model is implemented by a predictive neural network. \nEach network is trained to accurately predict the next frame of speech within segments of \nspeech corresponding to  its phoneme model. Continuous scores (prediction errors) are \naccumulated for various word candidates. The LPNN module produces either a single \n\n\f186  Waibel, et al. \n\nhypothesized sentence or the first N best hypotheses using a modified dynamic-program(cid:173)\nming beam-search algorithm (Steinbiss 1989). The LPNN system has speaker-dependent \nword accuracy rates of 93% with first-best recognition, and sentence accuracy of 69%. \n\nLVQ is a vector clustering technique based on neural networks. We have used LVQ  to \nautomatically cluster speech frames into a set of acoustic features;  these features are fed \ninto a set of output units that compute the emission probability for HMM states. This tech(cid:173)\nnique gives speaker-dependent word accuracy rates of 98%,86%, and 82%  for English \nconference registration tasks of perplexity 7, 61, and Ill, respectively. The sentence rec(cid:173)\nognition rate at perplexity 7 is 80%. \n\nWe are also evaluating other approaches to speech recognition, such as the Multi-State \nTDNN for continuous-speech (Haffner, Franzini, and Waibel 1991) and a neural-network \nbased word spotting system that may be useful for modeling spontaneous speech effects \n(Zeppenfield and Waibel 1992). The recognitions systems'  text output serves as input to \nthe alternative parsing modules of JANUS. \n\n3  LANGUAGE UNDERSTANDING AND TRANSLATION \n\n3.1  LANGUAGE ANALYSIS \n\nThe translation module of JANUS  is based on the Universal Parser Architecture (UPA) \ndeveloped at Carnegie Mellon (Tomita and Carbonell  1987; Tomita and Nyberg 1988). It \nis designed for efficient multi-lingual translation. Text in a source language is parsed into a \nlanguage independent frame-based inter lingual representation. From the interlingua, text \ncan be generated in different languages. \n\nThe system requires hand-written parsing and generation grammars for each language to \nbe processed. The parsing grammars are based on a Lexical Functional Grammar formal(cid:173)\nism, and are implemented using Tomita's Generalized LR parsing Algorithm (Tomita \n1991). The generation grammars are compiled into LISP functions. Both parsing and gen(cid:173)\neration with UP A approach real-time. Figure 2 shows an example of the input, interlingual \nrepresentation, and the output of the JANUS system \n\n3.2  PARSEC: CONNECTIONIST PARSING \n\nJANUS  can use a connectionist parser in place of the LR parser to process the output of \nthe speech system. PARSEC is a  structured connectionist parsing architecture that is \ngeared toward the problems found in spoken language (for details, see Jain 1992 (in this \nvolume) and Jain's PhD thesis, in preparation). PARSEC networks exhibit three strengths: \n\n\u2022  They automatically learn to parse, and generalize well compared to hand-coded \n\ngrammars. \n\n\u2022  They tolerate several types of noise without any explicit noise modeling. \n\u2022  They can learn to use multi-modal input such as pitch in conjunction with syntax and \n\nsemantics. \n\nThe PARSEC network architecture relies on a variation of supervised back-propagation \nlearning. The architecture differs from  some other connectionist approaches in that it is \nhighly structured, both at the macroscopic level of modules. and at the microscopic level \nof connections. \n\n\fJANUS: Speech-to-Speech Translation \n\n187 \n\nInput \n\nHello is this the office for the conference. \n\nInterlingual Representation \n\n((CFNAME  *is-this-phone) \n\n(MOOD  *interrogative) \n(OBJECT \n\n((NUMBER  sg) \n\n(DET  the) \n\n(SADJUNCTl \n\n(CFNAME  *conf-office))) \n\n((CFNAME  *hello)))) \n\nOutput \n\nJapanese: MOSHI  MOSHI  KAIGI  JIMUKYOKU  DESUKA \nGerman:  HALLO  1ST  DIES  DAS  KONFERENZBUERO \n\nFigure  2:  Example of input, interlingua, and output of JANUS \n\n3.2.1  Learning and Generalization \n\nThrough exposure to example output parses, PARSEC networks learn parsing behavior. \nTrained networks generalize well compared to hand-written grammars. In direct tests of \ncoverage for the conference registration domain, PARSEC achieved 67% correct parsing \nof novel sentences, whereas hand-written grammars achieved just 5%,25%, and 38% cor(cid:173)\nrecl Two of the grammars were written as part of a contest with a large cash prize for best \ncoverage. \n\nThe process of training PARSEC networks is  highly automated, and is  made possible \nthrough the use of constructive learning coupled with a robust control procedure that \ndynamically adjusts learning parameters during training. Novice users of the PARSEC \nsystem were able to train networks for parsing a German-language version of the confer(cid:173)\nence registration task and a novel English air-travel reservation task. \n\n3.2.2  Noise Tolerance \n\nWe  have compared PARSEC's performance on noisy input with  that of hand-written \ngrammars. On synthetic ungrammatical conference registration sentences, PARSEC pro(cid:173)\nduced acceptable interpretations 66% of the time,  with the three hand-coded grammars \nmentioned above performing at 2%, 38%, and 34%, respectively. We have also evaluated \nPARSEC in the context of noisy speech recognition in JANUS, and this is discussed later. \n\n3.2.3  Multi-Modal Input \n\nA somewhat elusive goal of spoken language processing has been to utilize information \nfrom  the speech signal beyond just word sequences in higher-level processing. It is well \nknown that humans use such information extensively in conversation. Consider the utter(cid:173)\nances \"Okay.\" and \"Okay?\" Although semantically distinct, they cannot be distinguished \nbased on word sequence, but pitch contours contain the necessary information (Figure 3). \n\n\f188 \n\nWaibel, et al. \n\nFILE: S.O.O  \"Okay.\"  duration = 409.1  msec, mean freq = 113.2 \n0.1  *.......... \n0.0 \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n. ........... . \n\nFILE: q.O.O  \"Okay?- duration = 377.0 msec, mean freq  =  137.3 \n0.6 \n0.5 \n0.4 \n0.3 \n0.2 \n0.1  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022 _---, \u2022\u2022\u2022\u2022\u2022\u2022 \n\n........ \n\nFigure  3:  Smoothed pitch contours. \n\nIn a grammar-based system, it is difficult to incorporate real-valued vector input in a use(cid:173)\nful way. In a PARSEC network, the vector is just another set of input units. A module of a \nPARSEC network was augmented to contain an additional set of units that contained pitch \ninformation. The pitch contours were smoothed output from  the OGI Neural Network \nPitch Tracker (Barnard et al.  1991). \n\nWithin the JANUS system, the augmented PARSEC network brings new functionality. \nIntonation affects translation in JANUS when using the augmented PARSEC  network. \nThe sentence, \"This is the conference office.\" is translated to \"Kaigi jimukyoku desu.\" \n\"This is the conference office?\" is translated to ''Kaigi jimukyoku desuka?\" This required \nno changes in  the other modules of the JANUS system. It also should be possible to use \nother types of information from the speech signal to aid in robust parsing (e.g. energy pat(cid:173)\nterns to disambiguate clausal structure). \n\n4  SPEECH SYNTHESIS \n\nTo generate intelligible speech in the respective target languages, we have predominantly \nused commercial devices. Most notably, DEC-talk has provided unrestricted English text(cid:173)\nto-speech synthesis. DEC-talk has also been used for Japanese and German synthesis. The \ninternal English text-to-phoneme conversion rules and tables of DEC-talk were bypassed \nby external German and Japanese text-to-phoneme look-up tables that convert the Ger(cid:173)\nman/Japanese target sentences into phonemic strings for DEC-talk synthesis. The result(cid:173)\ning synthesis is limited to the task vocabulary, but the external tables result in intelligible \nGerman and Japanese speech-albeit with a pronounced American accent \n\nTo allow for greater flexibility in vocabulary and more language specific synthesis, several \nalternate devices are currently being integrated. For Japanese, in particular, two high qual(cid:173)\nity speech synthesizers developed separately by NEC and A1R  will be used to  provide \nmore satisfactory results.  In JANUS, no attempt has so far been made to adapt the output \nspeech to  the input speaker's voice characteristics. However, this has recently been dem(cid:173)\nonstrated by work with code book mapping (Abe, Shikano, and Kuwabara 1990) and con(cid:173)\nnectionist mapping techniques (Huang, Lee, and Waibel 1991). \n\n\fJANUS: Speech-to-Speech Translation \n\n189 \n\n5  IMPLEMENTATION ISSUES AND PERFORMANCE \n\n5.1  Parallel Hardware \nNeural network forward passes for the speech recognizer were programmed on two gen(cid:173)\neral purpose parallel machines. a MasPar computer at the University of Karlsruhe, Ger(cid:173)\nmany and an Intel iWarp at Carnegie Mellon. The MasPar is a parallel SIMD machine \nwith 4096 processing elements. The iWarp is a MIMD machine, and a  16MHz, 64 cell \nexperimental version was used for testing. \n\nThe use of parallel hardware and algorithms has significantly decreased JANUS's process(cid:173)\ning time. Compared to forward pass calculations performed by a DecStation 5000, the \niWarp is 9 times faster (41.4 million connections per second). The MasPar does the for(cid:173)\nward pass calculations for a two second utterance in less than 500 milliseconds. Both the \niWarp and MasPar are scalable. Efforts are underway to implement other parts of JANUS \non parallel hardware with the goal of near real-time performance. \n\n5.2  Performance \n\nCurrently, English JANUS using the LR parsing module (JANUS-LR) performs at 87% \ncorrect translation using the LPNN speech system with the N-best sentence hypotheses. \nGennan JANUS performs at 97% correct translation (on a subset of the conference regis(cid:173)\ntration database) using Gennan versions of the LPNN system and LR parsing grammar. \n\nEnglish JANUS  using PARSEC (JANUS-NN) does not perform as well as the LR parser \nversion in N-best mode, with 80% correct translation. PARSEC is not able to select from a \nlist of ranked candidate utterance hypotheses as robustly as is the LR parser using a very \ntight grammar. However, the grammar used for this comparison only achieves 5% cover(cid:173)\nage of novel test sentences, compared with PARSEC's 67%. This vast difference in cover(cid:173)\nage explains some of the N -best performance difference. \n\nIn First-best mode, however, JANUS-NN does better than J ANUS-LR (77% versus 70%). \nThe PARSEC network is able to produce acceptable parses for a number of noisy speech \nrecognition hypotheses, but JANUS-LR tends to reject those hypotheses as unparsable. \nPARSEC's flexibility,  which hurt its N-best performance, enhances its F-best perfor(cid:173)\nmance. No performance evaluations were carried out using German PARSEC in German \nJANUS. \n\n6  CONCLUSION \n\nIn this paper we have described JANUS, a multi-lingual speech-to-speech translation sys(cid:173)\ntem. JANUS uses a mixture of connectionist, statistical and rule based strategies to achieve \nthis goal. Connectionist models have contributed in providing high performance recogni(cid:173)\ntion and parsing performance as well as greater robustness in the light of task variations and \nsyntactically  ill-formed  sentences.  Connectionist  models  also  provide  a  mechanism  for \nmerging traditionally distinct symbolic (syntax) and signal-level (intonation) information \ngracefully and achieve successful disambiguation between grammatical statements whose \nmood can  be affected by  intonation.  Finally,  connectionist sentence analysis  appears  to \noffer high flexibility as the relevant modules can be retrained automatically for new tasks, \ndomains and even languages without laborious recoding.  We  plan to  continue exploring \ndifferent mixtures of computing paradigms to achieve higher performance. \n\n\f190  Waibel, et al. \n\nAcknowledgements \n\nThe authors gratefully acknowledge the support of A1R Interpreting Telephony Laborato(cid:173)\nries, Siemens Corporation, NEC Corporation, and the National Science Foundation. \n\nReferences \n\nAbe,  M.,  K.  Shikano, H.  Kuwabara.  1990.  Cross Language Voice  Conversion.  In IEEE \nProceedings of the International Conference on Acoustics, Speech, and Signal Process(cid:173)\ning. \n\nBarnard, E., R.  A. Cole, M. P.  Vea, F.  A. Alleva 1991. Pitch Detection with a Neural-Net \n\nClassifier. IEEE Transactions on Signal Processing 39(2): 298-307. \n\nHaffner, P., M. Franzini, and A. Waibel.  1991. Integrating time alignment and neural net(cid:173)\nworks for  high performance speech recognition.  In IEEE Proceedings of the  I nterna(cid:173)\ntional Conference on Acoustics, Speech, and Signal Processing. \n\nHuang,  X.  D., K. F.  Lee, A. Waibel.  1991. In Proceedings of the IEEE-SP  Workshop  on \n\nNeural Networksfor Signal Processing. \n\nJain,  A.  N.  1992.  Generalization performance in  PARSEC-A structured  connectionist \nlearning architecture.  In Advances in Neural Information Processing Systems 4, ed. J. \nE. Moody, S. J. Hanson, and R. P. Lippmann. San Mateo, CA: Morgan Kaufmann Pub(cid:173)\nlishers. \n\nJain, A.  N.  In preparation. PARSEC:  A Connectionist Learning Architecture for Parsing \n\nSpoken Language. PhD Thesis, School of Computer Science, Carnegie Mellon Univer(cid:173)\nsity. \n\nSchmidbauer, O. and J. Tebelskis. 1992. An LVQ based reference model for speaker-adap(cid:173)\ntive  speech  recognition.  In  IEEE  Proceedings  of the  International  Conference  on \nAcoustics, Speech, and Signal Processing. \n\nSteinbiss,  V.  1989.  Sentence-hypothesis  generation  in  a  continuous-speech  recognition \nsystem.  In Proceedings of the  1989 European  Conference  on  Speech  Communication \nand Technology, Vol. 2, 51-54. \n\nTebelskis, J., A. Waibel, B. Petek, O. Schmidbauer. 1991. Continuous speech recognition \nby Linked Predictive Neural Networks. In Advances in Neural Information Processing \nSystem  3, ed.  R.  Lippmann, J.  Moody,  and  D.  Touretzky.  San  Mateo,  CA:  Morgan \nKaufmann Publishers. \n\nTomita, M.  (ed.).  1991. Generalized LR Parsing.  Norwell, MA:  Kluwer Academic  Pub(cid:173)\n\nlishers. \n\nTomita, M. and 1. G. Carbonell.  1987. The  Universal Parser Architecture for KlWwledge(cid:173)\n\nBased Machine  Translation.  Technical Report CMU-CMT-87-01, Center for Machine \nTranslation, Carnegie Mellon University. \n\nTomita,  M.  and  E.  Nyberg.  1988.  Generation  Kit  and  Transformation  Kit.  Technical \n\nReport CMU-CMT-88-MEMO, Center for Machine Translation, Carnegie Mellon Uni(cid:173)\nversity. \n\nZeppenfield,  T.  and  A.  Waibel.  1992.  A  hybrid  neural  network,  dynamic  programming \nword  spotter.  In  IEEE  Proceedings  of the  International  Conference  on  Acoustics, \nSpeech, and Signal Processing. \n\n\f", "award": [], "sourceid": 480, "authors": [{"given_name": "Alex", "family_name": "Waibel", "institution": null}, {"given_name": "Ajay", "family_name": "Jain", "institution": null}, {"given_name": "Arthur", "family_name": "McNair", "institution": null}, {"given_name": "Joe", "family_name": "Tebelskis", "institution": null}, {"given_name": "Louise", "family_name": "Osterholtz", "institution": null}, {"given_name": "Hiroaki", "family_name": "Saito", "institution": null}, {"given_name": "Otto", "family_name": "Schmidbauer", "institution": null}, {"given_name": "Tilo", "family_name": "Sloboda", "institution": null}, {"given_name": "Monika", "family_name": "Woszczyna", "institution": null}]}