音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

Computer Recognition of Dynamic Gestures for Use in Augmentative Communication Abstract

Richard Foulds, Ph.D. and Beth Mineo Mollica, Ph.D.   Applied Science and Engineering Laboratories University of Delaware/A.I. duPont Institute

Abstract

The use of gesture-based communication systems by individuals with significant expressive communication difficulties holds promise, but is limited by the need for interpretation by the receiver of the message. The development of a computer recognition technique that accommodates articulation variations and classifies gestures within streams of dynamic data is described.

Background

Gestural communication is believed to precede vocal communication in most individuals. Manual pointing, reaching and grasping movements, eyegaze, and facial expressions appear at early developmental levels (Kinsbourne, 1986).

In some instances, the use of gestures is not replaced by speech, but develops fully into an articulation of language. Among people who are profoundly deaf and are unable to use the auditory channel to reinforce the development of speech, the use of sign language (e.g., American Sign Language) is common.

The use of gesture as a mode of augmentative com- munication has been discussed in several contexts (Foulds, 1990; Lloyd and Karlan, 1984). The exam- ples below describe two instances where gestural systems are used as alternatives to other augmenta- tive communication techniques.

Example 1. An adolescent male with cerebral palsy has an exten- sive gesture vocabulary (+150 signs) comprised of conventional ASL signs, modified ASL signs, and gestures of his own inven- tion. The modified signs represent compromises in his precision and range of movement. Handshapes are important in differentiat- ing one sign from another.

His communicative attempts are rich in information. He often accompanies his gestures with facial expression, non-consonantal vocalization, or other body movements such as jumping or stiffen- ing. He does not chain these signs and gestures according to for- mal syntactic rules (either ASL or English); however, he does "cluster" them when several gestures are needed to elaborate a complex relationship.

Those living with this young man and his classroom teacher understand most of his gestures, while his classmates, classroom aides, and those in the community are unable to interpret most of his gestural expression.

Example 2. A second adolescent male with a rare, progressive neurological disorder, has developed his own gestural language with approximately 30 signs. These are invented gestures and have little similarity to ASL signs. The articulation of the signs is compromised by his motoric disability which limits accuracy, range and speed of movements. Due to physical limitations, only two handshapes (open and closed) are used.

This gestural vocabulary has been promoted by classroom teach- ers and parents and is used as a primary mode of communication. Since vision is compromised by cataracts, he receives only audi- tory language feedback, and must rely almost entirely on his prop- rioceptive feedback for gesture production and refinement.

He uses his gestures individually or in simple combinations, but does not produce syntactically correct sentences. His vocabulary is continuing to expand with the invention of entirely new signs and the combination of existing signs into new signs.

As in the first example, these signs are intelligible only to family members and the classroom teacher.

Statement of the Problem

The two examples introduced above offer both a jus- tification for the use of gestural communication in augmentative communication and an illustration of its primary disadvantage. While both individuals demonstrate reasonable capabilities, their use of their gestural vocabulary and the expansion of that vocabulary are very likely bounded by the exceed- ingly limited environment in which the gestures are understood. The general problem addressed by this paper is the expansion of that language environment by means of computer recognition of such gestures and their translation into synthesized speech.

Further consideration of the examples defines parameters that are essential to a recognition system. Paramount among these is the broad range of varia- tion that exists across the signs used by the individuals. The two sets of signs described in the examples have few gestures in common.

In contrast to sign languages used by individuals who are deaf, gestural vocabularies used in augmen- tative communication systems are often idiosyn- cratic, having been developed in isolation and in accord with the physical abilities of the user. Roy and his colleagues (1994) have shown that many non-speaking individuals are motorically capable of making gestures that are repeatable and can be mapped to words or concepts. Due to the nature of the underlying physical disability, these gestures may not follow any standardized form nor be easily recognized as iconic representations.

A system capable of recognizing gestures must be capable of learning such idiosyncratic gestures and cannot be based on a general model of gesture production.

Also evident in the physical production of gestures by both individuals in the examples is the variation in articulation of the same gesture due to the physical disability. The same intended gesture may be altered in accuracy of formation and placement, and may vary in duration. These must be accounted for in a recognition systems.

Additionally, unlike communication systems which are marked with keystrokes, contact of a pen with a surface, or even speech, gestural messages are contained amid a continuous flow of movement of the arms and hands. It is important to identify a meaningful gesture from other non-gesture movements, and separate one gesture from the next.

Approach

Of concern in this paper is the recognition of the arm and head movements which are dynamic in the time domain. Handshape recognition, which is important in an augmentative communication system, is con- siderably more postural and is addressed elsewhere (Messing, Erenshteyn, Foulds, Galuska, & Stern, 1994).

The approach presented here advances the recognition of arm and head movements in several areas, and overcomes many of the limitations of recognition approaches previously discussed in the literature. As with the previous work, this recognition approach is based on a pattern classification tech- nique that is capable of learning an individual's "style" of movement. Harwin (1991) employed Hidden Markov Models, which are commonly used in speech recognition. Pericos and Jackson (1994) adopted a template matching approach with Dynamic Time Warping, also used extensively in speech recognition. Cairns and Newell (1994) have compared both Dynamic Programming and Hidden Markov Models. Roy et al. (1994) employed back propagation neural networks in their recognition of the trajectory of the arm.

In general, all of the pattern classifiers share com- mon approaches. The positional data, a stream of three-dimensional coordinates varying in time, defines the limb or head trajectory. All of the classifi-ers use experimental data to derive templates or weights in an attempt to "learn" individual gestures. Subsequent data is compared to what has previously "learned" and best matches are established.

The classification techniques also share common problems. None of the classifiers has been able to deal with the variation in production duration that is common among individuals with significant physical impairment, and none has been able to identify ges- tures within a continuous flow of movement.

Thus, a gesture that is produced more slowly or more rapidly may appear to the classifiers as very different than the training gestures, even though a human observer may identify these as having the same meaning. Similarly, a human who understands the gestural code may observe a movement and identify the production of meaningful gestures while a computer-based classifier will see only a continuous movement and not identify the beginning and endings of the individual gestures.

As a result, the prior work requires that the stream of gestural data be segmented artificially (by the researchers) for training and test procedures. These projects also seriously limit the variation in duration of a gesture.

The work presented in this paper builds upon the handwriting recognition research of Morasso and colleagues (1993). The recognition of cursive handwriting represents a two dimensional problem simi- lar to the recognition of arm and head movements. The individual letters may vary in duration and size. They are also interconnected within words and are essentially a continuous stream of data points. Individual letters, like individual gestures, are not pro- duced in isolation.

The improvements to the recognition of dynamic gestures include the segmentation of the trajectory according to biomechanical measures. This does not identify the boundaries of individual gestures, but segments the data stream into a series of strokes that are defined by computing the velocity profile of the X, Y, and Z components of the movement. These strokes have been found to be readily and reliably determined.

A gesture, as in Morasso's cursive letters, can be redefined as a combination of strokes. In order to overcome the temporal and size variations, each stroke (which is of course subject to timing and siz- ing variation) is further processed into a multi-feature vector. This vector characterizes the stroke according to its shape, orientation, and length within the three dimensional space. This allows the length and orientation information to be separated from the shape. (Length may be used in further classification of large vs. small gestures, and orientation may be used to separate the same strokes made in different positions.)

Following the biomechanical segmentation, a com- mittee of unconnected back propagation trained neural networks examines sequences of vectors of successive strokes. Since gestures may be comprised of different numbers of strokes, the parallel neural networks each process different numbers of strokes.

As an illustration, two simple gestures are compared. The first is a circle made with the hand. The second is an up/down movement of the hand. The raw data for each gesture will depend upon its duration, size and orientation.

It is highly likely that subsequent articulations of these gestures could not accurately reproduce their precise trajectories or timing, and would produce very different sets of raw data.

However, when the data are subjected to biomechan- ical segmentation based upon the velocity profiles, the circle is found to be comprised of four strokes (four arcs that are each in different orientations), and the up/down movement is represented by two strokes (two straight lines each in different orientations). Each stroke is represented by a multi-dimensional vector that accounts for its shape, orientation, and length.

The neural network is trained on the multi-dimensional vectors of known gestures. In the example, there would be a committee of four neural networks that are trained on different numbers of strokes. All vectors are passed by the committee. One network sees the vector of only the current stroke under examination, the second sees that vector as well as the vector of the previous stroke, the third examines the current vector and the two previous vectors, and the fourth see the current vector and the three previous vectors.

Thus, the two stroke up/down gesture will be identified by the two stroke neural network and report a high value for recognition. The other networks that are looking at different number of strokes will report lower values. Similarly, the circle, with four strokes will be identified by the four stroke network.

Implications

The implications of this work are significant in that they make possible the minimization of production variations of timing, accuracy, and orientation. Also possible is the identification of gestures within a data stream, reducing the need for artificial segmentation.

Discussion

The technique presented in this paper offers to improve the ability of computers to classify or recognize arm and head movement that are major components of meaningful gestures. Coupled with ongoing handshape classification, it will be possible to dem- onstrate a gesture recognition system that can process dynamic movements and classify them into known gestures. Such gestures can then be mapped onto language units (words, phrases, concepts, etc.) that can be spoken through a speech synthesizer.

The ability to transform gestures, which are meaningful but unintelligible to a general audience, into speech will allow users to more fully exploit their gesture capabilities and will likely expand their ges- ture vocabulary and enhance their communication interactions.

References

Cairns, A. & Newell, A. (1994.) Towards gesture recognition for the physically impaired. Proceedings of the RESNA 1994 Confer- ence (pp.414-416). Arlington, VA: RESNA.

Foulds, R. (1990.) Look, listen, and try not to be discrete. In B. Mineo (Ed.), Augmentative and Alternative Communication in the Next Decade (pp. 89-92). Wilmington, DE: Applied Science and Engineering Laboratories.

Harwin, W. (1991.) Computer recognition of the unconstrained and intentional head gestures of physically disabled people. Doctoral Dissertation, Cambridge University, Cambridge, UK.

Kinsbourne, M. (1986.) Brain organization underlying orientation and gestures: Normal and pathological cases. In J. Nespoulous, P. Perron, and A. Lecours (Eds.), The Biological Foundations of Gesture (pp. 49-64). Hillsdale, NJ: Lawrence Erlbaum.

Lloyd, LL., & Karlan, G. (1984.) Nonspeech communication sys- tems and symbols: Where have we been and where are we going? Journal of Mental Deficiency Research, 38, 2-30.

Messing, L., Erenshteyn, R., Foulds, R., Galuska, S., Stern, G. (1994). American sign language computer recognition. Proceed- ings of ISAAC 94 (pp. 289-291). Toronto, Canada: ISAAC.

Morasso, P., Barberis, L., Pagliano, S., & Vergano, D. (1993.) Recognition experiments of cursive dynamic handwriting with self organizing networks. Pattern Recognition, 26, 451-460.

Pericos, C. & Jackson, R. (1994.) A head gesture recognition sys- tem for computer access. Proceedings of the RESNA 1994 Con- ference (pp. 94-96). Arlington, VA: RESNA.

Roy, D., Panayi, M., Foulds, R., Erenshteyn, R., Harwin, W., & Fawcus, R. (1994.) The enhancement of interaction for people with severe speech and physical impairment through computer recognition of gesture and manipulation. Presence, 3, 227-235.

Acknowledgments

This research is supported by the Rehabilitation Engineering Research Center on Augmentative Communication, Grant Number H133E30010 from the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education, and the Nemours Foundation.

Author address

Richard Foulds and/or Beth Mineo Mollica

Applied Science and Engineering Laboratories

University of Delaware/A.I. duPont Institute

P.O. Box 269

Wilmington, DE 19899 USA

Email foulds@asel.udel.edu // mineo@asel.udel.edu