Tutorial 8: Speech Translation: Theory and Practice

Monday, May 27, 9 am-12 noon

Presented by

Bowen Zhou, Xiaodong He


In this tutorial, we first survey the latest statistical machine translation (SMT) technologies, presenting both theoretic and practical perspectives that are most relevant to speech translation (ST). Next, we review key learning problems, and investigate essential model structures in SMT, taking a unified perspective to reveal both the connections and contrasts between automatic speech recognition (ASR) and SMT. Despite addressing two important yet different problems, we view SMT and ASR are closely related (with some exceptions) in their modeling and optimization techniques. For example, by using the formalisms familiar to the speech community, we show that phrase-based SMT can be viewed as a sequence of finite-state transducer (FST) operations, similar in spirit to ASR. We further inspect the synchronous context-free grammar (SCFG) based formalism that includes hierarchical phrase-based and many linguistically syntax-based models. Decoding for ASR, FST-based and SCFG-based translation are also presented from a unified perspective as different realizations of the generic Viterbi algorithm on graphs or hypergraphs. These consolidated perspectives are helpful to catalyze tighter integrations for improved ST and sharing optimization techniques in discriminative training. In particular, this unified perspective enables us to present an end-to-end framework for speech translation, including joint modeling of the speech translation problem by a log-linear model framework with feature functions derived from ASR and SMT components, joint decoding of ASR and MT through a WFST/confusion network, and joint training of ASR and MT models by optimization of an end-to-end speech translation criterion.

Beyond providing a systematic tutorial of the general theory of speech translation, we also present hands-on experience on building state-of-the-art speech translation/spoken language translation systems. In the tutorial, we will share our practice with concrete examples drawn from our experience on major ST/SLT research projects such as DARPA TRANSTAC and GALE/BOLT, NIST OpenMT, and IWSLT, which we have worked on extensively and have developed highly competitive entries in these evaluations.

1. Overview of speech translation
   i. Major speech translation research projects
   ii. High-level architecture for speech translation
   iii. Automatic speech recognition
   iv. Statistical machine translation
   v. Evaluation metrics
2. Learning of translation models
   i. Word alignment model
   ii. From words to phrases
   iii. Parameterization of SMT
   iv. Discriminative training of translation models
3. Translation structures for ST
   i. FST-based translation equivalence
   ii. SCFG-based translation equivalence
   iii. Learning of SCFG-based models
4. Unified translation decoding
   i. Graph, hypergraph, and generalized Viterbi decoding
   ii. Viterbi decoding for FST-based MT
   iii. Viterbi decoding for SCFG models
5. End-to-End optimization of ST
   i. End-to-End joint modeling of ST
   ii. Joint decoding of ASR and MT in ST
   iii. Joint training of ASR and MT with end-to-end ST criteria
6. Summary and discussion
   i. Ongoing and future advances in ASR and SMT

Speaker Biography

Bowen Zhou

Bowen Zhou is a Research Manager and Research Staff Member with the IBM Thomas J. Watson Research Center, where he has been heading the Dept. of Machine Translation and Learning, and previously, the Dept. of Speech-to-Speech Translation. He is also serving as a Principal Investigator of the DARPA Transformative Apps and previously also a PI for the TRANSTAC Program. In these roles, he has been responsible to advance end-to-end speech-to-speech translation technologies spanning areas of speech recognition, natural language processing and machine translation etc. Most recently, he is leading a research team to collaborate and compete on machine translation with top researchers around the world under the DARPA BOLT Program. In addition to his research agenda, Dr. Zhou also develop and deploy speech and machine translation systems with his colleagues for various applications and platforms, ranging from Smartphones to Clouds. Among other things, he and his team created the MASTOR, a multilingual two-way real-time speech-to-speech translator that is completely hosted on smartphones.

Dr. Zhou is a senior member of IEEE and a member of ACL, and he is currently an elected member of the IEEE Speech and Language Technical Committee. He also served as Area Co-Chair of Machine Translation of NAACL/HLT 2012, and an invited panelist at NAACL/HLT 2010. He has also served as committee members and session chairs for major speech and language conferences, professional meetings and workshops. He has published over 60 papers on top conferences and journals. Dr. Zhou has a broad interest in statistical models, machine learning and human language technologies including both speech and text. Dr. Zhou received the B.E. degree from the University of Science and Technology of China in 1996 and the Ph.D. degree from the University of Colorado at Boulder in 2003.

Xiaodong He

Xiaodong He is a Researcher in the Speech Technology Group of Microsoft Research, Redmond, WA, USA. He is also an Affiliate Professor in the Department of Electrical Engineering at the University of Washington, Seattle, WA, USA. His research interests include speech recognition, spoken language understanding, machine translation, natural language processing, information retrieval, and machine learning. Dr. He has published more than 50 technical papers these areas and co-authored the book Discriminative Learning for Speech Recognition: Theory and Practice. In benchmark evaluations, he and his colleagues have developed entries that obtained No. 1 place in the 2008 NIST Machine Translation Evaluation (NIST MT) and the 2011 International Workshop on Spoken Language Translation Evaluation (IWSLT), both in Chinese-English translation, respectively. He has held various editorial positions on leading professional journals, including Associate Editor of IEEE Signal Processing Magazine, Guest Editor of IEEE Transactions on Audio, Speech and Language Processing, and Lead Guest Editor of IEEE Journal of Selected Topics in Signal Processing. He served as Co-Chair of Special Sessions of ICASSP 2013 and general Co-Chair of the Workshop on Speech and Language at NIPS 2008. He was in program committees of major speech and language processing conferences including ICASSP, INTERSPEECH, ACL, AAAI, EMNLP, NAACL, NIPS, COLING. He is a senior member of IEEE and a member of ACL. Dr. He received the BS degree from Tsinghua University in 1996 and PhD degree from the University of Missouri - Columbia in 2003.