Statistical Methods for Speech Recognition

Frederick Jelinek

Mentioned 2

This book reflects decades of important research on the mathematical foundations of speech recognition. It focuses on underlying statistical techniques such as hidden Markov models, decision trees, the expectation-maximization algorithm, information theoretic goodness criteria, maximum entropy probability estimation, parameter and data clustering, and smoothing of probability distributions. The author's goal is to present these principles clearly in the simplest setting, to show the advantages of self-organization from real data, and to enable the reader to apply the techniques.

More on Amazon.com

Mentioned in questions and answers.

I want to know about various techniques to do speech recognition and text to speech conversion. Also please let me know about any resources like links, tutorials ,ebooks etc. on it.

Which is the most efficient technique to achieve it ?

I'm going to answer the part about speech recognition (since I don't know much about text-to-speech):

This book, "Statistical Methods for Speech Recognition" is a classic that explains the mathematical foundations of statistical speech recognition, written by the founder of that area, Frederick Jelinek.

The most important concept you have to know is Hidden Markov Models. People have been using them in speech recognition for decades. A recent approach uses Conditional Random Fields, see the paper (PDF) and the associated software toolkit SCARF.

It is fairly hard to write your own speech recognizer. It's an active research area with several scientific conferences, e.g. ASRU, Interspeech, ICASSP.

I'm developing a project that identifies Phonemes to be able to identify whether someone is saying either "Yes" or "No".

So far in the project, I have used Zero-crossings to identify what the person is saying, this works really well and seems simple enough to understand. The project, however, needs a few enhancements and has to be developed using a Hidden Markov Model.

My question is this:

I want to develop a Hidden Markov Model, without erasing the work that I have already completed. I.e. I strip the data that do not warrant consideration by counting the number of zero-crossings as well as the summation of the blocks.

I do not understand what data I would need to train the HMM in order to be able to identify these Phonemes. E.g.

With Zero-crossings I have identifies that:

Yes - Zero-crossings start low and then the value increases

No - Zero-crossings start low and then do not increase with value.

Could I train my HMM algorithm so that it interprets these values?

Or could anyone suggest a method of which I can train the HMM to be able to identify the word that is inputted in the sample?

Hope someone can help :)!

Automated phoneme segmentation is a tough problem, so I'll provide some of my favored resources that touch on the topic in various levels of detail.

This paper: http://www.seas.upenn.edu/~jan/Files/Iscas99Speech.pdf

This paper: http://www.ll.mit.edu/publications/journal/pdf/vol08_no2/8.2.1.languageidentification.pdf

This resource is very good: http://research.microsoft.com/pubs/118769/Book-Chap-HuangDeng2010.pdf

This book gives some good examples for phoneme identification: http://www.amazon.com/Speech-Recognition-Theory-C-Implementation/dp/0471977306/

This book is pretty good, too: http://www.amazon.com/Statistical-Methods-Recognition-Language-Communication/dp/0262100665/

The books are expensive, but they are worth it (in my opinion)