Tutorial 15: Multi-modal user interfaces: a new music instruments perspective
Monday, May 27, 2-5 pm
George Tzanetakis, Sidney Fels
There is an increasing interest in rich user interfaces that go beyond the traditional mouse/keyboard/screen interaction. In the past few years, their development has accelerated due to the wide availability of commoditybsensors and actuators. For example, the Microsoft Kinect provides, at low cost, a structured light infrared depth camera, a regular color camera, and a microphone array. Moreover, smart phones contain a variety of additional sensors such as accelerometers that provide unique opportunities for control. Unlike traditional controllers such as a mouse that provide direct and simple sensor readings, these new rich interfaces provide high dimensional, noisy and complex sensor readings. Therefore sophisticated digital signal processing and machine learning techniques are required in order to develop effective human computer interactions using such interfaces. At the same time they offer fascinating possibilities of blending the physical and virtual world such as non-invasive full body control and augmented reality. Different algorithms and customizations are required for each specific application and possibly user so there is a large design space for novel research and contributions.
Traditional musical instruments are some of the most fascinating artifacts created by human beings in history. The complexity and richness of control afforded by an acoustic musical instruments, such as a cello, to a professional musicians is impressive. Research in new interfaces for musical expression has explored how such complex and delicate control can be combined with the ability to interact with a computer. In some ways, research in this area has anticipated the development of augmented reality rich multi-modal interfaces and provides good examples of rich sensory interactions. Based on these considerations, new musical instruments will be used throughout this tutorial in order to present working case studies and examples for illustrating particular concepts. At the same time, the concepts covered are general and can be applied to any type of rich multi-modal human computer interaction. There is also a rich literature focusing on speech interaction (speech recognition, and synthesis) that can also be multi-modal (typically audio-visual). In this tutorial, we will only briefly touch on speech based interactions and only for ideas and concepts that are relevant to other types of interaction, as it is a more specialized topic that is much more covered in existing literature and previous tutorials.
The target audience is graduate students and researchers with a basic background in digital signal processing that are interested in designing and developing novel human-computer interactions using rich multi-modal interfaces. It also targets researchers interested in multi-modal interfaces that come from a computer science background and might not be familiar with the possibilities offered by modern digital signal processing techniques. Many of the techniques used in the design and development of rich multi-modal interfaces, such as dynamic time warping (DTW) or Hidden Markov Models (HMM), are familiar to DSP practitioners. At the same time, real-time interactive interfaces pose specific challenges such as real time implementation, causality, fault tolerance that are not as critical in other areas of DSP. Therefore, part of the tutorial will show how DSP can be adapted for this application context to perform tasks such as automatic calibration, gesture detection, and tracking. The design and development of rich multi-modal interfaces is inherently interdisciplinary and also involves knowledge that is typically not part of the standard electrical and computer engineering curriculum. For example, the evaluation of such interfaces is not as straightforward as evaluation is in other areas of DSP and requires understanding of concepts from human computer interaction. Because of the nature of this research, the literature tends to be scattered across many different communities which makes it harder to pursue for new comers to the field. This tutorial tries to cover the all the basic concepts needed to embark in this journey.
George Tzanetakis is an Associate Professor in the Department of Computer Science with cross-listed appointments in ECE and Music at the University of Victoria, Canada. He is Canada Research Chair (Tier II) in the Computer Analysis and Audio and Music and received the Craigdaroch research award in artistic expression at the University of Victoria in 2012. In 2011 he was Visiting Faculty at Google Research. He received his PhD in Computer Science at Princeton University in 2002 and was a Post-Doctoral fellow at Carnegie Mellon University in 2002-2003. His research spans all stages of audio content analysis such as feature extraction,segmentation, classification with specific emphasis on music information retrieval. He is also the primary designer and developer of Marsyas an open source framework for audio processing with specific emphasis on music information retrieval applications. His pioneering work on musical genre classification received a IEEE signal processing society young author award and is frequently cited. He has given several tutorials in well known international conferences such as ICASSP, ACM Multimedia and ISMIR. More recently he has been exploring new interfaces for musical expression, music robotics, computational ethnomusicology, and computer-assisted music instrument tutoring. These interdisciplinary activities combine ideas from signal processing, perception, machine learning, sensors, actuators and human-computer interaction with the connecting theme of making computers better understand music to create more effective interactions with musicians and listeners.
Sidney has been in the department of Electrical and Computer Engineering at the University of British Columbia since 1998. Sidney received his Ph. D. and M.Sc. in Computer Science at the University of Toronto in 1994 and 1990 respectively. He received his B.A.Sc. in Electrical Engineering at the University of Waterloo in 1988. He was a visiting research at ATR Media Integration & Communications Research Laboratories in Kyoto, Japan from 1996 to 1997. He also worked at Virtual Technologies Inc. in Palo Alto, CA developing the GesturePlus system and the CyberServer in 1995. His research interests are in human-computer interaction, neural networks, intelligent agents and interactive arts. Some of his research projects include Glove-TalkII, Glove-Talk, Iamascope, InvenTcl, French Surfaces, Sound Sculpting and the context-aware mobile assistant project (CMAP). Using the Glove-TalkII system a person could speak with their hands. The device was built to be a virtual artificial vocal tract. The person using the system wore special gloves and used a foot pedal. These devices controlled a model of a vocal tract so that a person could "play" speech much as a musician plays music. His collaborative work on sound sculpting is an extension of this idea to create musical instruments. The Iamascope is an interactive artwork which explores the relationship between people and machines. In Iamascope the participant takes the place of the coloured piece of glass inside the kaleidoscope. The participant's movements cause a cascade of imagery and music to engulf them. His other artwork includes the Forklift Ballet, Video Cubism, PlesioPhone and Waking Dream. He currently heads the Human Communication Technologies (HCT) Laboratory and is the Director of the Media and Graphics Interdisciplinary Centre (MAGIC) at the University of British Columbia.