音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

Web Posted on: August 4, 1998

Video Based Gesture Recognition for Augmentative Communication

Louise Clarke
Peter Harper
Richard B. Reilly

Dept. of Electronic and Electrical Engineering and
University College of Dublin
Belfield, Dublin 4
Rep. of Ireland
Tel : +353-1-7061909
Fax : +353-1-2830921
E-mail : richard.reilly@ucd.ie

1. Summary

Augmentative and alternative communication (AAC) systems can be described as methods which endeavour to provide enhanced communication possibilities. AAC aims to provide access to technology for those without the fine motor control necessary to drive the "standard" interfaces such as keyboard and mouse. A non-contact video based method of AAC has been developed. This includes the automatic location of important physical features on the face of the user, such as the eyes and mouth. These features can then be tracked from frame to frame and transformed into 2-D co-ordinates and echoed to the user application screen as the mouse cursor.

| Top |

2. Introduction

When assessing or prescribing an AAC system, it has been understood by clinicians for some time that not one but rather a combination of devices solutions is deemed best practice [1]. For such a multimodal approach to function the AAC system must be customisable to the individual. Thus requiring each of the individual elements of the system to be highly configurable, to meet the individual's needs. It is generally accepted that users can "adapt" their response to suit the interface device, but a more appropriate solution must include the ability for the system to adapt to the user.

A number of motion and movement tracking devices are available, some placing the sensing, reflective or transmitting element at an anatomical site [1][2]. Some systems employ transmitters worn by the user [3], others use reflective markers for echoing the sensing signal back to a transceiver module [4]. Others are electro-magnetic based [5] and some systems make use of eye movements where the centre of the iris is tracked [6], while in others CCD image sequences are analysed and the contrast of the hair and face employed to recognise face orientation [7]. The vast majority of these methods employ some accessory to be worn by the user. The LAMP system, developed during the TIDE Bridge phase is a complete non-contact movement tracking system [8].

| Top |

3. Video-Based Gesture Tracking

Based on the experiences of the LAMP project, a video based movement analysis system has been developed. The system allows tracking of the head of the user through the use of a standard low cost CCD video camcorder (320x240 pixels in 24-bit colour) located on top of a PC monitor, together with a Creative Labs Video Blaster RT-300 capture card. The analysis procedure locates automatically important features on the face of the user, such as the eyes and mouth. These features can then be tracked from frame to frame and transformed into 2-D co-ordinates and echoed to the user application screen as the mouse cursor.


3.1 Face Location

One very distinct feature of the human face is skin colour, providing differentiation between the background environment and the face of the user. According to Yang [9], "skin colour is a perceptual phenomenon, not a physical one", in that it relates to the spectral characteristics of electromagnetic radiation in the visible wavelength striking the retina. Detecting skin colour has three main associated problems: skin colour depends on lighting conditions (natural and artificial), different cameras produce different colour values under the similar lighting conditions and human skin differs greatly from person to person. Considerable research has been carried out to develop a skin colour model as a method of locating features for face tracking [10],[11].

In an AAC environment, the system must be able to cope with a variety of skin types, lighting and video hardware. As a result an automatic skin colour map was developed. As part of this, a blink detection algorithm was designed, where an image is captured with the user looking at the centre of the computer screen, and another captured when the user blinks [12]. The RGB (Red-Green-Blue) colour distance of the two images is then calculated on a pixel-by-pixel basis.

By setting a threshold value for the colour distance, areas of significant movement between image frames can be located. As these areas contain the eyes, a region in the same locality can be employed for statistical analysis of skin colour. This area is examined pixel by pixel, to extract the user's specific skin colour information.

| Top |

3.2 Feature Location

The typical physical facial features used in movement tracking include: the centre of the pupils of the eye, corners of the mouth, nostrils and the tip of the nose. In this system, the outside corners of the eyes and mouth were tracked, due to being the most successful features, as they have sharp corners and were easy to locate. The method of skin colour analysis was successful in locating the face region and a similar procedure was adopted to locate the features within the face. It was assumed that the user was initially in a near frontal position and therefore both eyes are visible. Knowledge of feature locations is used to restrict search areas for the eyes to the upper half of the face region and for the mouth to the lower half, Figure 1. Anthropmetric measures can also be used to reduce the search window [13].

A particular pixel must be selected on each feature to enable tracking in subsequent images. Initial attempts were made to locate and track the outer corner of each eye, however this proved problematic when the user turned his/her head to the right or left with the outer edge of the eye disappears or is lost against the image background. The inner corner of each eye was selected as it remains in view even on maximum rotation. The shape of the mouth is similar to that of an eye and the same algorithm is applied to locate the rightmost and leftmost points of the lips. White crosses are placed over the four points of interest as can be seen in Figure 2.

| Top |

3.3 Feature Tracking in Real-Time.

Knowledge of the previous feature location and velocity of the movement is used to reduce the size of the search area for feature points in subsequent images. Motion interpolation is used to calculate the expected new position and the search area is centred about this calculated position. The width and height of the search area are related to the velocity of the point, thus for rapid movement from left to right, the search box is wider than it is tall. This allows fast location and tracking of all feature points. Two levels of error checking were also implemented for points straying outside the face, or jumping up from the eye to the eyebrow. Tracking failure was assumed, if one of the points is found to be moving in a significantly different direction than the other three. Periodically the position of the face as a whole is checked. If any of the points being tracked were found to lie outside of the face, a tracking failure was also assumed.

| Top |

3.4 Conversion to Mouse Positional Co-ordinates

To control the position of a mouse pointer the centroid of the feature points is transformed into a 2-dimensional co-ordinate mouse position. A moving average filter, taking the current mouse position and averaging it with the n previous positions was used to produce a smooth mouse trace. With the system operating at 30 FPS, n=6 produced satisfactory performance. An overview of the acquisition and tracking procedure can be seen in Figure 3.

Figure 3: An overview of the acquisition and tracking procedure

| Top |

4. Augmentative Interaction

To allow augmentative interaction, systems must be capable of not only providing control of the mouse cursor movements but full control of standard commercially available software applications. This includes provision for item selection tasks such as the mouse click, double click and drag/drop features. A click action can be defined as a single action performed by the user to generate a mouse click. The technique employed in this system to perform a click action involves opening of the mouth. The mouth is a highly deformable feature as when closed, is much wider than it is high and when open the ratio of width to height is reversed. The x co-ordinates of the left and right mouth edges are available as two feature points being tracked on the mouth. A similar algorithm was written to obtain the y co-ordinates of the top and bottom lip. By setting a threshold value, opening the mouth greater than the threshold generates a click action. This click action is completely non-intrusive, not physically demanding, instantaneously recognised and implemented (unlike "Acceptance time/Dwell time" technique [1]) and lends itself easily to the development of drag/drop functions.

| Top |

5. Conclusion

A non-contact video based facial tracking system has been developed which automatically locates and tracks the user's facial features and transforms these into 2-D co-ordinates, which are then echoed to the user application screen as the mouse cursor. With this system, the user can move freely in the field of view of the camera, completely unhindered by any accessories or markers. Current research is concentrating on the development of on-line tremor suppression algorithms. A more complex but neural network recognition system is also currently being developed which would allow the user, with the aid of a Therapist to train the system to specific needs.

| Top |


[1] "Assistive Technologies: Principles and Practice", Cook A.M., Hussey S.M., Mosby -Year Book Inc., St. Louis, 1995.

[2] "Analysis of Intentional Head Gestures to Assist Computer Access by Physically Disabled People", Harwin W.S., Jackson R.D., Journal of Bio-Medical Engineering, Vol. 12, No.3, May 1990.

[3] HeadMaster System : Prentke Romich Co. Wooster, Ohio, USA

[4] HeadMouse, Origin Instruments, Greenview Drive, Grand Prarie, Texas 75050.

[5] Tracker, Madenta Communications. 20 Ave. Edmonton, Alberta 6N 1E5, Canada

[6] "Real-time Eye Feature Tracking from a Video Image Sequence using Kalman Filter", Xie X., Sudhakar R. Zhuang H., IEEE Trans. on Systems, Man and Cybernetics, Vol. 25, No. 12, 1995.

[7] "Headreader: Real-time Motion Detection of Human Head from Image Sequences", Mase K., Watanabe Y, Yasuhito Y., Systems and Computers in Japan, Vol. 23, No. 7, 1992.

[8] "Laser Mouse", J. Conlineau, J-Cl. Lehereau, D. Mazerolle, S. Formont, R.Reilly, J-Cl. Gabus, M. Butler, V. Sprécacénéré and B. Tenneson, in "The European Context for Assistive Technology", eds. I. Placencia Porrero, E. Puig de la Bellacasa. IOS Press, 1995.

[9] "Skin-Colour Modelling and Adaptation", J.Yang, W.Lu, A.Waibel, Proceedings of ACCV 1998, Vol. II, pp 687-694, Hong Kong.

[10] "Facial Feature Extraction from colour images", T.C.Chang, T.S.Huang, C.Novak, Proc. the 12th IAPR, Int.Conf. on Pattern Recognition, Vol.2, pp39-43, 1994.

[11] "A Model-Based Gaze Tracking System", R.Stiefelhagen, J.Yang, A.Waibel, International Journal of Artificial Intelligence Tools, Vol. 6, No. 2, pp 193-209, 1997.

[12] "Computer Vision Techniques for Man-Machine Interaction", J. Crowley. Proceedings of the Irish Machine Vision and Image Processing Conference (IMVIP-97), Vol. 1, pp 1-9, 1997.

[13] "Controlling a Computer via Facial Aspect", P.Ballard, G.C.Stockman, IEEE Trans. on Systems, Man and Cybernetics, Vol. 25, No. 4, 1995.

| Top | | TIDE 98 Papers |