COMPUTER ANIMATED FINGERSPELLING FOR ASSISTIVE TECHNOLOGY

Clarke Steinback
Computer Science Board
University of California, Santa Cruz
Santa Cruz, California 95064
Internet: ranger@cse.ucsc.edu

Suresh Lodha
Computer Science Board
University of California, Santa Cruz
Santa Cruz, California 95064
Internet: lodha@cse.ucsc.edu

Web Posted on: December 12, 1997

0. ABSTRACT

Fingerspelling is one component of American Sign Language (ASL) using movement of a single hand to translate English text into sign. The task of computerized fingerspelling, that is converting English text into images of movement of a single hand, is very important for assistive technology. This paper provides the very first step towards this task by converting English letters into images of movements of a single hand. A word and subsequently English text can also be converted into a series of images corresponding to fingerspelling. The most difficult and challenging part is to provide a non-distracting continuous transition from one letter to another. This transition information between every pair of letters is captured with a CyberGlove and stored in the CARPI database. This system conducts text analysis to extract letter pairs and their corresponding transition information from the CARPI database. The animation process then uses the transition information from the database to produce the animation sequence from the text.

1. INTRODUCTION

Fingerspelling is one of the communications tools used by deaf and hearing impaired individuals to communicate with each other and with hearing people. A language such as American Sign Language (ASL), is a complicated visual system involving hands, arms, torso, head and facial expressions with static and dynamic aspects to the signs. The fingerspelling system within ASL uses only a single hand held in front of the torso to form gestures representing the letters of the English alphabet. Focusing just on fingerspelling for use in assistive technology simplifies the problem to that of correct gestural movements of a single hand. Although the individual fingerspelling signs consisting of hand positioning and motion components are not in themselves excessively complicated, there are two key aspects in order to make computer animated fingerspelling successful for assistive technology -- realistic human hand rendering and realistic animation of transition between letter pairs.

Information for displaying a single gesture can be obtained by several means including photometric measurement, manual measurement, and data glove. However, letters occur in streams, not as isolated elements, thus any animation of fingerspelling would need to display these transitions effectively. Viewers notice when an animation of a human figure does not behave correctly, whether the movement is a walking or running motion, facial expression or hand gestures. Any animation of fingerspelling should have realistic signs and transitions between signs, otherwise the animation could be perceived as flawed, and may not be of much practical value. Thus the need for realistic animation of human hands for assistive technology becomes more complicated endeavor than just placing a rendered hand at the end of an animated arm. This paper attempts to achieve this realistic animation by capturing and using transitions information.

2. BACKGROUND AND RELATED WORK

Two major approaches for computer assisted fingerspelling have been demonstrated. One using video clip technology and the other using computer generated animation.

The video clip technology uses digital recordings of actual signers to produce realistic images and motion of fingerspelling and sign phrases (Haas 92). The system requires large storage capacity and provides playback only of pre-recorded sequences. For translation devices, video clips would need to be linked together to produce flexible dynamic display. Thus the system should use a consistent signer appearance and background for all phrases to minimize visual anomalies when linking images.

Computer generated animation allows assembling new phrases out of the words/letters that the system knows how to generate (Magnenat-Thalmann 90, Ohki 94). In such systems, the computer renders the signs as needed, thus providing the flexibility needed for translation. Holden and Roy (Holden 92) developed a system to translate English text to Australian Sign Language as a teaching aid. This system on a Sun Sparc Workstation used a manual editing system to create the hand poses for each letter and produced 2D animation. Lee and Kunii (Lee 92, Lee 93) used both a manual notation system and data glove captured information to produce 3D animation for fingerspelling on a SGI Personal Iris for development of synthetic actors. Ohki et. al. (Ohki 94) captured positional information with a data glove in developing an information kiosk prototype using a HP 9000/720 to display 3D animated signing of limited words and phrases in Japanese Sign Language. None of these animated systems used portable systems which may be more convenient for assistive technology.

All of these animated systems used linear interpolation between signs which can lead to unnatural appearing hand motion. To avoid unrealistic motion, features such as collision detection (Moore 88), motion constraints (Witkin 88), goal directed animation (Bruderlin 89), and knowledge based systems (Rijpkema 91) have been used for other animation tasks. Lee and Kunii (Lee 93) added constraints to their system to attempt to prevent unrealistic behavior in the animation. However for real-time assistive technology, these constraints may slow the system down and may not be necessary. Since a sign pairs can be considered repetitive motion, a controller can be developed which avoids the need for constraints, similar to the human athlete animation of Hodgins (Hodgins 95).

3. COMPUTERIZED FINGERSPELLING SYSTEM

We now present the highlights of our computerized fingerspelling system for assistive technology.

Since only a fixed number of letters exist in English text and thus a fixed number of combinations of the letters, ASL fingerspelling can be viewed as repetitive motion directed by a script (the text). To make use of this notion, we developed a hand model that allowed us to record the position of the hand and fingers with use of a CyberGlove. A library of poses and their appropriate intermediate poses (much like a controller) was created. A display program was developed that reads the text in letter pairs and used the library of intermediate positions to produce the corresponding animation. This approach allowed the processor of the system to be used for animation and not the calculation of constraints.

3.1 HAND MODEL

Although the human hand consists of bone, muscle and deformable skin as well as motion constraints, the current experiment utilized an object oriented model of stylized bone having 20 joints and 24 degrees of freedom (DOF) with no constraints. To obtain the appropriate proportions and fixed rotation components for the bones, both physical measurements of actual human hands were used, along with measurements obtained from hand models in Wilson and Wilson (Wilson 78).

The model allows for 4 DOF for each finger, 5 DOF for the thumb, and 3 DOF for the wrist. The two most distal joints of each digit require only a single positioning value since these joints only flex. The proximal joint of each of the fingers requires two positional values, and the proximal thumb joint requires three positional values. Additionally the wrist requires three values for changes in rotation, flex and yaw. The object oriented model allows for messages to be sent to the hand object to position the selected component according the specified rotations. The hand model is also responsible for rendering the bones as cylinders and the joints spheres.

3.2 DATA CAPTURE

A CyberGlove with position tracker on a SGI Indigo 2 Extreme was used in an interactive data capture system allowing recording of finger, thumb and wrist positions. The data capture system allows capture of all 24 DOF with visual feedback as to the result.

The heuristic mechanism needed to store the transitional position and movement information for each pair of signed letters is viewed as a controller. Each pair of letters has a starting state and an ending state, as well as the transitional information specific to the pairing. The individual controllers are collected into a larger database structure called the Computer Animation Representational Position Index (CARPI) in this paper. The database structure is simple, having an index based on the letter pair and an array of rotations corresponding to each hand component for each time step in the transition.

3.3 INPUT ENVIRONMENT

The input process starts with the parsing of the text information. The first letter pair is formulated with neutral hand position and the first letter of the text. The input system uses this pair to lookup the initial state and transitional information in CARPI. Once the appropriate controller has been located, the actual animation process can be performed for that pair.

After the first pair of letters has been parsed, the input process reads the next letter in the text stream. The second letter of the previous pair of letters plus the new letter form the new transitional letter pair. Again the letter pair is used to lookup the appropriate controller and the animation process can be performed with this data. The input process continues reading the next letter and lookups the letter pair in CARPI until the end of the input stream. The last letter is paired with the neutral position to create a last transition pair.

3.4 ANIMATION PROCESS

The animation process involves using a program to display the hand model and animate the model based on the controller information contained in CARPI database. The program ran on Windows NT 3.51 using OpenGL, Borland Delphi 2.0 and Microsoft Visual C++ 2.0. The process uses the controller information to change the geometry of the model and to render the hand for viewing. The timing of the steps is controlled by the speed setting and thus determines the use of the controller.

3.5 EXPERIMENTAL RESULTS

The data capture system allowed for capture of all 24 DOF available in the hand model for letter pairs. Using this capture system, the hand geometry for the letters A through Y, and for the letter pairs AB, AE, E_, SA, and _S (where '_' represents pause) was captured and stored in the CARPI.

The animation used the captured data stored in the CARPI and was displayed on a laptop (Intel 486 DX2 66 with 20 MB RAM running Windows NT 3.51). Each of the letter pairs was viewed with no inbetweening, linear interpolation and the captured transitional data. It was observed that with the increasing level of complexity of the data, the display system's performance decreased.

4. CONCLUSION

The model of the hand was found to be suitable for this demonstration of the approach. The implementation of the controllers within the animation worked with the limited letter list, and demonstrated that some sign pair controllers allow for more realistic motion of the hand than do their corresponding linear interpolation between the signs. Overall the experiment showed the validity of the concept of using controllers for more realistic animation of fingerspelling sign pairs than linear interpolated method.

Overall, the system showed that recognizable fingerspelling images can be produced from free form text input using inexpensive laptop computers. The glove capture system allowed for easier and less time consuming capture of the hand geometries than manual measurements, even with the glove calibration time for each user. The display program on the laptop computer produced recognizable fingerspelling images, though the selection of either linear interpolation or transitional data from the database for inbetweening produced lags in the image display that were long compared to actual fingerspelling.

5. FUTURE WORK

Further work is needed to determine the level of hand image quality and the level of motion realism of the successive hand images sufficient for both the understanding by the user and the timely display of text-to- fingerspelling. Also, user interface issues such as the rate of the hand image presentation rate, hand image size, and background images needed to accommodate understanding by different users should to be addressed. Furthermore, since there is variation in data capture between the same letter signs, a method needs to be developed that presents the ending sign of one pair and the beginning sign of the next such that there is no discernible discontinuity for the user. Finally, a user evaluation system should be developed to determine minimum levels and desired levels of informational components (quality, motion, speed, and display size), and desired levels of non-informational components (background image, hand color, and lighting) for both the understanding by the user and the timely display of text-to-fingerspelling.

ACKNOWLEDGMENTS

We thank Dr. Alex Pang for his generous support in allowing us to use his graphics laboratory and virtual reality equipment including CyberGlove.

REFERENCES

Bruderlin, A. and T.W. Calvert. "Goal-directed, Dynamic Animation of Human Walking." SIGGRAPH Conference Proceedings, 23(3):233-242, August 1989.

Haas, C., and S.X. Wei, "Stanford American Sign Language Videodisc Project." Proceedings of the Johns Hopkins National Search for Computing Applications to Assist Persons with Disabilities (1-5 Feb. 1992). Los Alamitos, CA, USA: IEEE Computer Society Press, 1992. p. 41-44.

Hodgins, J. "Animating Human Athletes." SIGGRAPH Conference Proceedings, 29(4):71-78, August 1995.

Holden, E.J. and G.G. Roy. "The Graphical Translation of English Text into Signed English in the Hand Sign Translator System." Eurographics. 11(3):C357-C366. 1992.

Lee, J. and T.L. Kunii. "Hand Motion Coding System for Algorithm Recognition and Generation." Proceedings of Computer Animations '92. Editors: N. Magnenat-Thalmann and D. Thalmann, Springer- Verlag, Tokyo, Japan, 1992.

Lee, J. and T.L. Kunii. "Models and Techniques in Computer Animations." Proceedings of Computer Animations '93. Springer-Verlag, Editors: N. Magnenat-Thalmann and D. Thalmann, Tokyo, Japan, 1993.

Magnenat-Thalmann, N. and D. Thalmann. Synthetic Actors in Computer-Generated 3D films. Springer- Verlag, New York, 1990.

Moore, M. and J. Wilhelms. "Collision Detection and Response for Computer Animation." SIGGRAPH Conference Proceedings, 22(4):289-298, August 1988.

Ohki, M., H. Sagawa, T. Sakiyama, E. Oohira, H. Ikeda and H. Fujisawa."Pattern Recognition and Synthesis of Sign Language Translation System." ASSETS '94: the First Annual ACM Conference on Assistive Technologies. 1(1):1-8. 1994.

Rijpkema, H., and M. Girard. "Computer Animation of Knowledge-Based Human Grasping." SIGGRAPH Conference Proceedings, 25(4):339-348, July 1991.

Wilson, D.B. and W.J. Wilson. Human Anatomy, Oxford University Press, New York, 1978.

Witkin, A. and M. Kass. "Spacetime Constraints." SIGGRAPH Conference Proceedings, 22(4):159-168, August 1988.

Go to the top of this page. | Go to the upper category.