Fundamentals of Speech Recognition

Lawrence R. Rabiner, Biing-Hwang Juang

Mentioned 3

Provides a theoretically sound, technically accurate, and complete description of the basic knowledge and ideas that constitute a modern system for speech recognition by machine. Covers production, perception, and acoustic-phonetic characterization of the speech signal; signal processing and analysis methods for speech recognition; pattern comparison techniques; speech recognition system design and implementation; theory and implementation of hidden Markov models; speech recognition based on connected word models; large vocabulary continuous speech recognition; and task- oriented application of automatic speech recognition. For practicing engineers, scientists, linguists, and programmers interested in speech recognition.

More on

Mentioned in questions and answers.

As described in several books, the process of recognition of isolated words consists of the following:

  1. For a given set of signals(templates), determine feature vector for each template – matrix M×N, where M is number of features(MFCC,ZCR,…) and N is number of frames.
  2. Train the templates with some algorithm, such as ANN, HMM, GMM, SVM.
  3. Recognize test signal by trained model.

Because speech signals have different duration, their lengths are aligned by Dynamic Time Warping (DTW) technique, so that N is same for all templates. It can be done during training.

My question is: How to change length of test signal? I can not use DTW on it, since I do not know to which class it belongs. Should I use "time stretching" algorithms, preserving pitch and if I should, how this will affect recognition accuracy?

You do not need to change the length to make a match. You extract features from reference samples and test samples, they all have different number of frames. Then you apply DTW between each reference and test thus aligning them. As a result of DTW runs you get the score of match between test sample and each of the references. What you do is you stretch non-uniformly each reference sample to match with test sample. Because you compared each reference with single test, you can use DTW scores in comparison. So you select the reference with best score as a result.

For details and ideas of DTW speech recognition check this presentation.

If you want to get closer to ideas of speech recognition with DTW, you can read a book Fundamentals of Speech Recognition 1st Edition by Lawrence Rabiner, Biing-Hwang Juang.

I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).

I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).

This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it

My question is how to train a classifier to detect the word?

The simple form would be a GMM classifier, you can check here:

In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one

I have audio records of 4 phonemes (a, e, o, u) from 11 people. I trained an ANN using the data from 10 people, and used the other set for testing. I used 14 LPC coefficients of the first period (20ms) of records as features.

The training matrix I has 14 rows and 10 columns for each phoneme. So it is 14*40. Since it is a supervised classification problem, I constructed a target matrix T which is 4*40. It contains ones and zeros where a 1 indicates that the corresponding column in I is from that class.

The test data matrix contains four columns and 14 rows as it contains 4 phonemes from only one person. Let us call it S.

Here is the code:

net = newff(I, T, 15);
net = init(net);
net.trainParam.epochs = 10000;
net.trainParam.goal = 0.01;
net = train(net, I, T);
y1 = sim(net, I);
y2 = sim(net, S)

The results are not good even I give the training data as test data (y1).

What is wrong here?

I used 14 LPC coefficients of the first period (20ms) of records as features.

So did you ignore almost all the sound data except first 20ms? It doesn't sound right. You must have calculate an average over all frames at least.

What is wrong here?

You started coding without understanding a theory. Probably you want to read some introduction first. At least this and ideally this

To understand why ANN doesn't work calculate how many parameters are required to map 10 features to 4 classes, then calculate how many training vectors do you have for every parameter. Take into account that for every parameter you need at least 10 samples for initial estimation. That means your training data is not enough.