Tutorial 2: Auditory Transform and Features for Robust Speech and Speaker Recognition

Sunday, May 26, 2-5pm

Presented by

Qi (Peter) Li


This tutorial provides an introduction to the hearing system, auditory transform, and auditory-based feature extraction to our colleagues and students who have no prior experience in using auditory modeling in noise reduction, speech recognition, and speaker authentication.

The human speech recognition system includes the ear and the brain. A lot of effort has been made to develop large vocabulary recognizers which are similar to the function of the brain, while front end signal processing and feature extraction are similar to the function of the ear. In recent years, the feature extraction algorithms developed based on modeling the auditory system have shown robust performance in speech and speaker recognition in noisy environments and mismatched conditions.

In this talk, we will first introduce the human hearing system, including the functions of the outer ear, middle ear, inner ear, cochlea, and hair cells from a signal processing point of view. Animation videos will be presented to show how these organs work together, so their functions can be understood intuitively. We will then introduce a recently defined auditory transform (AT) and its inverse transform. The AT has been used in place of the Fourier transform (FT) in feature extraction algorithms with many advantages. We will present the theory, visual spectrograms, and comparative analysis of AT with FT.

Furthermore, we will review existing auditory-based features and present how to extract features from hearing models with the AT, discriminative training of feature parameters, and the effect of each module of the auditory features to speech and speaker recognition performances. We will then present our experimental results and compare them with traditional MFCC and RASTA-PLP features in large vocabulary speech recognition tasks and speaker identification tasks. Lastly, we will explain the reasons that auditory features are more robust and point out further research directions.

The talk will have the following sections:

  • Introduction
  • Hearing system
  • Animation videos of cochlea
  • Auditory transform
  • Models of outer ear, middle ear, inner ear, and hair cell
  • Auditory features
  • Discriminative training of auditory features
  • Experiments on speaker verification
  • Experiments on large vocabulary speech recognition
  • Future research directions
  • Conclusions

Speaker Biography

Qi (Peter) Li

Qi (Peter) Li received the Ph.D. degree in electrical engineering from the University of Rhode Island, Kingston, in 1995.

From 1995 to 2002, he worked at Bell Laboratories, AT&T and then Lucent Technologies, Murray Hill, NJ, as a Member of Technical Staff in the Multimedia Communications Research Lab, where his research focused on speech and speaker recognition, biometric authentication, front-end signal processing, and speech modeling. His research results were implemented in Lucent products and made a contribution to the Bell Labs ASR system, which achieved the top performance in a public robust speech recognition evaluation. Also, he won the best performance in a speaker verification evaluation conducted by one of the largest banks in the U.S. In 2002, he established Li Creative Technologies (LcT), Inc., Florham Park, NJ. LcT is a high-tech company in R&D for acoustic, speech and image signal processing, multimedia applications, biometrics, and communication products. He is currently conducting research in hearing, acoustic-signal processing, microphone arrays, speech and speaker recognition, noise reduction and cancellation, biometrics and image processing for various applications and commercial products. Dr. Li currently holds many issued patents and has filed more than two dozen patents. He has published more than 80 papers in peer-reviewed journals and conferences. He is also the author of the book Speaker Authentication (Springer, 2011).

Dr. Li has been active as a reviewer for several journals, IEEE publications, and conferences. He is an elected member of the Speech and Language Technical Committee of IEEE Signal Processing Society. He was a Local Chair for the IEEE Workshop on Automatic Identification and a committee member for several IEEE conferences and workshops. He received a best paper award, an achievement award, and several Bell Labs patent awards. He has been listed in Who's Who in America (Millennium and 2001 Editions) and Who's Who in Executives and Professionals (2004 Edition). In 2004, he received the Success Award issued by an agency of the New Jersey Government. He and his team received the Best Consumer Technology/Electronics Company Award issued by the New Jersey Technology Council in 2006 and an Innovations Design and Engineering Award issued by the International CES in 2011. His awards, patents, and publications are listed on his personal home page: www.lilabs.com, and most of his publications can be downloaded from there.