Voice User Interface Design

Michael H. Cohen, Michael Harris Cohen, James P. Giangola, Jennifer Balogh

This book is a comprehensive and authoritative guide to voice user interface (VUI) design. The VUI is perhaps the most critical factor in the success of any automated speech recognition (ASR) system, determining whether the user experience will be satisfying or frustrating, or even whether the customer will remain one. This book describes a practical methodology for creating an effective VUI design. The methodology is scientifically based on principles in linguistics, psychology, and language technology, and is illustrated here by examples drawn from the authors' work at Nuance Communications, the market leader in ASR development and deployment. The book begins with an overview of VUI design issues and a description of the technology. The authors then introduce the major phases of their methodology. They first show how to specify requirements and make high-level design decisions during the definition phase. They next cover, in great detail, the design phase, with clear explanations and demonstrations of each design principle and its real-world applications. Finally, they examine problems unique to VUI design in system development, testing, and tuning. Key principles are illustrated with a running sample application. A companion Web site provides audio clips for each example: www.VUIDesign.org The cover photograph depicts the first ASR system, Radio Rex: a toy dog who sits in his house until the sound of his name calls him out. Produced in 1911, Rex was among the few commercial successes in earlier days of speech recognition. Voice User Interface Design reveals the design principles and practices that produce commercial success in an era when effective ASRs are not toys but competitive necessities.

I'm using open ears for speech recognition in my app. The major concern is the accuracy. In a quiet environment there is about 50% accuracy, but things get worse in a noisy environment. Almost nothing is recognized correctly. I'm using a dictionary file of about 300 words at present. What are the areas I should look for to improve accuracy? Up to now I haven't done any tweaking on this.

The design of speech recognition applications requires you to understand some basic concepts behind speech recognition such as an acoustic model, grammar, and the phonetic dictionary. You can learn more from a CMUSphinx tutorial http://cmusphinx.sourceforge.net/wiki/tutorial

Bad accuracy is a normal state of the speech application development, there is a process which you can use to improve it and make the application useful. The process is the following:

  1. Collect speech samples you are trying to recognize and create a speech database to measure the current accuracy and understand the issues behind it

  2. Try to play with the vocabulary size in order to improve the separation between different voice prompts. For example the vocabulary of 10 commands is way easier to recognize than the vocabulary of 300 commands.

  3. Design your application the way that the number of variants to recognize is less and the answers of people are straightforward. This activity is called VUI (voice user interface design) and it's quite a big area with many brilliant books and blog articles. You can find some details here: http://www.amazon.com/Voice-Interface-Design-Michael-Cohen/dp/0321185765

  4. Try to improve the acoustic part of your application. Modify the dictionary to match your speech. Adapt the acoustic model to match the acoustic properties. See http://cmusphinx.sourceforge.net/wiki/tutorialadapt for the description of the acoustic model adaptation process.