Spoken Language Processing

Xuedong Huang, Alejandro Acero, Hsiao-Wuen Hon

Mentioned 2

Preface Our primary motivation in writing this book is to share our working experience to bridge the gap between the knowledge of industry gurus and newcomers to the spoken language processing community. Many powerful techniques hide in conference proceedings and academic papers for years before becoming widely recognized by the research community or the industry. We spent many years pursuing spoken language technology research at Carnegie Mellon University before we started spoken language R&D at Microsoft. We fully understand that it is by no means a small undertaking to transfer a state-of-the-art spoken language research system into a commercially viable product that can truly help people improve their productivity. Our experience in both industry and academia is reflected in the context of this book, which presents a contemporary and comprehensive description of both theoretic and practical issues in spoken language processing. This book is intended for people of diverse academic and practical backgrounds. Speech scientists, computer scientists, linguists, engineers, physicists, and psychologists all have a unique perspective on spoken language processing. This book will be useful to all of these special interest groups. Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology. There are a number of excellent books on the subfields of spoken language processing, including speech recognition, text-to-speech conversion, and spoken language understanding, but there is no single book that covers both theoretical and practical aspects of these subfields and spoken language interface design. We devote many chapters systematically introducing fundamental theories needed to understand how speech recognition, text-to-speech synthesis, and spoken language understanding work. Even more important is the fact that the book highlights what works well in practice, which is invaluable if you want to build a practical speech recognizer, a practical text-to-speech synthesizer, or a practical spoken language system. Using numerous real examples in developing Microsoft's spoken language systems, we concentrate on showing how the fundamental theories can be applied to solve real problems in spoken language processing.

More on Amazon.com

Mentioned in questions and answers.

I just read this post here (the first answer):

https://www.quora.com/What-is-a-hidden-Markov-Model-HMM-and-how-can-it-be-used-in-speech-recognition

I kinda get it, but not fully. Without getting too technical: How exactly does the word recognition work?

In the post, the hmm should recognize the word "cat" represented by the phonemes /k/ /a/ /t/.

So let's say the hmm is in the state for /k/. That means, it successfully recognized the /k/, right?

How exactly does it then recognize the /a/? There is a certain probability that after /k/ the next state is /a/, but also, e. g. that the next state is /e/, right?

Do these probabilities come from training the model on the corpus? So, if most words in the corpus are /ka/ instead of /ke/, the probability to go from state /k/ to /a/ is higher than to go from state /k/ to /e/?

In what way is it determined that next it will go to the state for /a/ and not to /e/?

And it says that the phonemes are the hidden parts... Does that mean, we cant see, which phoneme the model chose, we can just see that now it is in the state for /k/? And we can only see, which phonemes it chose, after it processed the entire word and the outcome is either correct or not?

And that would mean that it can only ever go from /k/ to /a/ but you cannot guarantee that it finds the correct phonemes...?

I'm not trying to understand the ins and outs of this model, just generally how it works for speech recognition.

It is better to read more reputable sources than random Quora answers to get full idea about complex algorithms. For example, Rabiner's HMM tutorial is a good idea. You can also check a textbook like Spoken Language Processing which gives a good description for the subject.

The sequence of observable events in speech recognition is sequence of audio frames. Each frame is roughly 20ms of sound. The sequence of unobservable events is roughly the sequence of phonemes. Actually it is more complex, but you can think phonemes. Beside HMM model which is just mathematical object, there is an important part about decoding algorithm called Viterbi search which finds the best match between observable and hidden states according to the probabilities. This algorithm efficiently evaluates all possible breakdowns and find the best one. That best one would be the decoding result.

So let's say the hmm is in the state for /k/. That means, it successfully recognized the /k/, right?

There is no such thing as "hmm in state k". We consider frame 1 and say it corresponds to "k", then we consider frame 2 and decide does it correspond to "k", to "a" or to "e". For that we use previous state for previous frame and also acoustic match between the frame 2 and all three states. This acoustic match is usually estimated with separate model, for example gaussian mixture model, do not confuse it with hidden markov model. Both models are estimated from the corpus. After we store some possible decisions for frame 2 we move to frame 3 to decide does it belong to any of expected hidden states. Notice we do not keep 1 best decision but multiple possible decisions on the way because locally best decision (2 corresponds to a) might not be globally best (2 corresponds to e). In the end of decoding we have a full relation between hidden and observable states and we can estimate the probability of this relation using HMM.

How exactly does it then recognize the /a/? There is a certain probability that after /k/ the next state is /a/, but also, e. g. that the next state is /e/, right?

It compares probabilities of breakdowns combined with the GMM score for this frame to update probabilities of breakdowns including this new frame. GMM score tells how good is match of the audio to the expected sound of "a" and is trained from the database.

And it says that the phonemes are the hidden parts... Does that mean, we cant see, which phoneme the model chose, we can just see that now it is in the state for /k/? And we can only see, which phonemes it chose, after it processed the entire word and the outcome is either correct or not?

We can only see phonemes after processing the entire word.

And that would mean that it can only ever go from /k/ to /a/ but you cannot guarantee that it finds the correct phonemes...?

Locally you can not guarantee that, you need to compare global picture or at least to do some iterations after current phoneme. Thats why you have to keep multiple decoding results during search, not just the best single one.

I'm trying to figure out what exactly context indepedent/dependent acoustic modeling is. I've been trying to read through some of the papers that address it but I'm still a little shaky with the concept. As I currently understand (which could be wrong) context dependent acoustic models are acoustic models trained on data where the phonemes occur in sequences. For example trained on a target language with words, so the phonemes are context dependent by the phonemes that occur before and after, giving them context. And independent context would be an acoustic model some how trained with just the phonemes in isolation.

The conventional approach is to recognize speech with a hidden Markov model (HMM). Basically in HMM you try to represent input sound as a sequence of states. Each state correspond to a certain part of the phoneme.

The difference is not on what the model is trained, but the structure of the model itself. Acoustic model is a set of detectors of sounds. Each detector describes what sound is alike, for example, it might be a Gaussian Mixture Model (GMM) which describes most probable values of phoneme features. Or it could be a neural network which detects specific sound.

In context-independent model the structure of hidden Markov model is simple, you detect all occurrences of phone with a single detector. Say you detect the word "hi" with the detectors for

 HH_begin HH_middle HH_end IY_begin IY_middle IY_end

And you detect word "hoy" with exactly same detectors for phone HH

 HH_begin HH_middle HH_end OY_begin OY_middle OY_end

In context-dependent model the detectors for HH in "hi" and "hoy" are different and trained separately. Basically they have different amount of parameters. This is reasonable because phones around do affect the pronunciation of the phone itself, the phone starts to sound a bit different. So you have

 HH_before_IY_begin HH_before_IY_middle 
     HH_before_IY_end IY_after_HH_begin 
        IY_after_HH_middle IY_after_HH_end

And for hoy

 HH_before_OY_begin HH_before_OY_middle 
     HH_before_OY_end OY_after_HH_begin 
        OY_after_HH_middle OY_after_HH_end

The advantage of this approach is that because you have more parameters you can recognize speech more accurately. The disadvantage is that you have to consider much much many variants instead.

Speech recognition algorithms are quite complex beyond what public web usually describes. For example, to reduce the amount of detectors context-dependent models are usually clustered and tied into some smaller set. Instead of hundreds of possible context dependent detectors you have just couple of thousands detectors merged to provide good discrimination and generalization.

If you are serious about speech recognition algorithms and practices instead of random sources on the web it is better to read a textbook like Spoken Language Processing or at least the paper The Application of Hidden Markov Models in Speech Recognition