音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

A SPEECH RECOGNITION SYSTEM FOR OPERATION OF A VISUAL INSPECTION WORKSTATION

Hasan Abu-Zaina, David M. Kirk and Elmer A. Hoyer Institute for Rehabilitation Research and Service Wichita State University Wichita, KS U.S.A.

ABSTRACT

A speech recognition algorithm utilizing the Hidden Markov Model has been implemented for use as voice control of a visual inspection workstation. This system was implemented as a speaker independent, isolated word, limited vocabulary system and trained for a 50 word vocabulary consisting of command words for the workstation. It was tested using 18 speakers, eight male and eight female, and found to have a recognition accuracy of 94.5% over all words and all speakers. The model was extended to recognize the speech of a person with a mild verbal impairment due to cerebral palsy achieving a recognition accuracy of 94% over all words. This system will be tested further with other individuals with verbal impairments and made to operate under Microsoft Windows to control a visual inspection workstation.

BACKGROUND

The use of the Hidden Markov Model (HMM) for speech recognition was first proposed by Baker [1]. It has since become the predominant approach to speech recognition, superseding the Dynamic Time Warping (DTW) technique. It has been found that the overall recognition accuracy and performance of the HMM is comparable to that obtained using a DTW algorithm with much less computation and storage requirements. With the HMM technique, the speech recognition process is generated by two interrelated mechanisms, a Markov chain having a finite number of states and a set of random functions, one of which is associated with each state. At discrete instants of time, the process is assumed to be in some state and an observation is generated by the random function corresponding to the current state. Each state is capable of generating a finite number of possible outputs. The Markov chain then changes states according to its transition probability matrix. It is quite natural to think of the speech signal as being generated by such a process. We can imagine the vocal tract as being in one of a finite number of articulatory configurations or states. In each state a short signal is produced that has one of a finite number of prototypical spectra depending, of course, on the state. Thus, the power spectra of short intervals of the speech signal are determined solely by the current state of the model while the variation of the spectral composition of the signal with time is governed predominately by the probabilistic state transition law of the Markov chain [2].

RESEARCH QUESTIONS

This research sought to answer the following questions: 1. Can a user independent, isolated utterance, small vocabulary speech recognition system be developed to assist an individual with hand motor impairment operate a visual inspection workstation? 2. Can this system be extended to assist an individual with verbal impairment?

METHOD

In training the HMM system, nine normative speakers were used with each speaker saying the same word, in isolation, 5 times. Some effort was made in recording these words to make sure that the sing-song tendency of some speakers saying the same word 5 times in a row is minimized. The speech wave files were edited to remove the silence at the beginning of each word. All speech files were sampled at 22050 Hz, with 16-bits per sample. The sound files were recorded using the mono input of a Pro-Audio Spectrum-16 sound card. The subject was isolated in a sound proof, sound booth. All recorded speech files were down sampled to 11025 Hz. For calculating the Linear Predictive Coding (LPC) parameters and cepstral coefficients, each speech waveform was partitioned into frames of 30 ms each with an overlap between adjacent frames of 15 ms. A Hamming window was used on each frame. The order of the LPC parameters was 10 and the dimension of the cepstral coefficients was 12. A 256 entry codebook for all the cepstral vectors of all words in the vocabulary was created. A model for each word in the vocabulary set was created (i.e. the A and B matrices). Left to right model was used with 5 states. The Buam-Welch algorithm [3] was used to find the transition matrix A and the output probability matrix B for each word. The block diagram shown in Figure 1 illustrate the process of getting the cepstral coefficients from the speech signal. The digitized speech signal is processed by a first order digital preemphasis filter in order to spectrally flatten the signal. Sections of consecutive speech samples, corresponding to 30 ms of the speech signal, are used as a single frame; consecutive frames were spaced 15 ms apart with 15 ms frame overlap. Each frame was multiplied by a Hamming window w(n) so as to minimize the adverse effect of extracting a 30 ms section out of the running speech signal. Each windowed set of speech samples is autocorrelated to give a set of eleven coefficients. For each frame, a vector of ten LPC coefficients is computed from the autocorrelation vector. An LPC derived cepstral vector is then computed up to the 12th component for each frame. The 12-coefficient cepstral vectors are then weighted by a cepstral window. From all cepstral coefficients of the speech signals, a codebook was created. Vector quantization (VQ) is then used to map each observation vector into a discrete codebook index using a simple nearest neighbor computation. A distinct HMM was designed for each of the 50 words in the vocabulary. A left to right model with 5 states was used for each word. It is assumed the model always starts in state 1 and ends in state 5. Initial estimates of the transition probability matrix A, and the output probability matrix B were used to get a good final values for A and B for each word, through the iterative Baum-Welch algorithm [3]. In the recognition phase, the speech signal is converted into a set of cepstral vectors, from which an observation sequence is obtained using the codebook. Then, the observation sequence is matched successively against each word HMM of the vocabulary using a Viterbi scoring algorithm [4]. Figure 2 illustrates the HMM recognition system.

RESULTS

The basic HMM system was tested using 18 normative speakers (nine males, and nine females) with only two of the speakers used also in training the HMM system. Each speaker uttered every word in the 50 word vocabulary twice. Speech signals were sampled in a noise free environment using a sound card that sampled at 22050 Hz, with 16-bits per sample. All speech wave files were edited to remove the silence before each word. Before calculating the LPC and cepstral vectors of the recorded speech, all files were down sampled to 11025 Hz. The overall performance of the HMM system on clean speech was 94.5% accuracy. The recognizer system was tested by gender of its speaker to see if there was any difference in the accuracy of the system and also tested by speaker to determine if the system was speaker independent or not. Two people were also in the training group and provided additional speech samples for testing. The HMM system was also tested by word to see if some words in the vocabulary set were recognized by the system better than other words in the set. First, the HMM system was tested by gender of the speaker across all words. With the female group of speakers the system achieved a recognition accuracy of 95% while when tested on the male group of speakers it achieved 93.9% accuracy. As it can be seen that the female group did slightly better than the male group. But since the difference is small it can be conclude the system was not gender biased. The HMM system was also tested by speaker across all words spoken by the individual speaker to see if there is any appreciable difference in the accuracy. Some speakers were recognized better by the system than others. The highest speaker accuracy was 97.0%. The lowest speaker accuracy was 87.0%. It was noticed that the speaker with 87.0% accuracy exhibited perceptible breath noise when talking. It was also noticed that the two speakers that were also used in training the system did not do any better or any worse than the other speakers in the testing speakers group (i.e. some speakers did better than these two, others did the same, and others did worse). This means that the system was speaker independent. The HMM system was also tested by word across all speakers (i.e. same word uttered by all speakers) to see if some words more easily recognized than others. It turned out that some words were recognized very well by the system, while others were recognized with a little less accuracy. The highest accuracy associated with a word was 100.0 % ( as an example of this is the word "contrast"). The lowest accuracy associated by a word was 83.3% (as an example of this is the word "help"). The HMM system was more sensitive to some words than others. It was also noticed that words with fricative sounds such as "f", and "s" sound were recognized with less accuracy than the other words.

DISCUSSION

The Hidden Markov Model (HMM) was first implement in a DOS environment. The DOS implementation provided a user independent, isolated utterance, small vocabulary, speech recognition system. The Visual Inspection Workstation for which this speech recognition system is being developed is required to run in Microsoft Windows, therefore, the DOS implementation of the Hidden Markov Model was ported to run under Microsoft Windows. The Microsoft Windows implementation added several additional features and offered several advantages that were not present in the DOS implementation. The first advantage is that Windows provides a software layer between the application and the sound card. Calls are not made directly to the sound card, but rather to the MCI Routines which then communicate with the sound card. This insulates the sound card from the application and allows any Windows compatible sound card to be used. Another advantage is that Windows provides a user friendly interface which can easily be extended to assist persons with disabilities. This extended user interface allows a person with a hand motor disability to control the visual inspection workstation with voice commands. The word boundary and detection routine is an important part of this extended user interface. This routine automatically determines the beginning and ending of each word. Voiced speech is detected by using an energy function, while unvoiced speech is detected by computing the zero-crossing of the word. The hardware required to run the speech recognition system is a typically configured personal computer that includes a sound card and a microphone. The sound card should be able to sample at 11 kHz with a sample size of 16 bits. The only hardware item that is not included with a typically configured PC is a video captured board.. The video capture board is used by the video portion of the visual inspection workstation but is not needed for the speech recognition part. In addition to this normative code book and model, a second code book has been generated for a speaker dependent model for an individual considered to have mild verbal impairment from cerebral palsy. This code book was generated using eight repetitions of each of the 50 words and tested using two repetitions of each word. This model achieved a recognition accuracy of 94%. It is unclear, at this time, if a codebook will be required for each individual or one codebook for each disability group. To determine this, data will be collected from persons from several disability groups where the disability is likely to cause verbal impairment and also from multiple people from one disability group with varying degrees of impairment. A different codebook for each disability group will first be generated and the model trained for this codebook. This model will then be tested for recognition accuracy within the group. If the recognition accuracy is comparable to that achieved with the normative data, the individual model will not be developed. If, on the other hand, adequate accuracy in not achieved, individual models for each member of the disability group will be developed and tested.

REFERENCES

1. Baker, J.K., "Stochastic modeling for automatic speech understanding," in Speech Recognition, R. Reddy, ed, New York: Academic Press, 1975.

2. Rabiner, L.R., S.E. Levinson, and M.M. Sondhi, "On the use of Hidden Markov Models for speaker-independent recognition of isolated words from a medium-size vocabulary," AT&T Bell Laboratories Technical Journal, vol. 63, no. 4, April 1984, pp. 627-642.

3. Lee, Kai-Fu, Automatic Speech Recognition, Kluwer Academic Publishers, 1989.

4. Viterbi, A.J., "Error bounds for convolutional codes and an asymptotically optimal decoding algorithm," IEEE Transactions on Information Theory, IT-13, April 1967, pp. 260-269.

ACKNOWLEDGMENTS

The authors wish to acknowledge that funding for this research was obtained from the U.S. Department of Education, Office of Special Education and Rehabilitative Services, National Institute on Disability and Rehabilitation Research as a part of Project 5 of the Wichita Rehabilitation Engineering Research Center. Hasan Abu-Zaina Wichita State University Department of Electrical Engineering Wichita, KS 67260-0044 Phone: (316) 689-3415 Fax: (316) 689-3853 e-mail: abu@ee.twsu.edu