AN INTERFACE INTEGRATING EYE GAZE AND VOICE RECOGNITION FOR HANDS-FREE COMPUTER ACCESS

Franz Hatfield and Eric A. Jenkins

Synthetic Environments, Inc.
1401 Chain Bridge Road, Suite 300, McLean, VA 22101
Telephone: 703-748-0450, e-mail: hatfield@SynthEnv.com

Web Posted on: November 30, 1997

1. Introduction

This paper describes our ongoing work to integrate eye-tracking and voice recognition technology for hands-free operation of a graphical user interface. The goal of this research is to enable system users to gaze on user interface items of interest and issue verbal commands or queries that can be interpreted by the system, thus permitting hands-free operation. While speech and eye gaze by themselves are inherently ambiguous input streams, with properly designed human-machine interaction, user intent can be inferred by fusing the information in the separate voice and eye gaze data streams. This paper describes an initial implementation of an Eye/Voice Aware (EVA) user interface that allows consumers to access most of the functionality in a standard Microsoft Windows(TM) interface.

Interfaces to today's computer systems are rapidly becoming entirely graphical. Unlike their earlier command-line counterparts, they are generally less accessible to persons with disabilities because, to operate them, a user needs to be able to visually acquire and manually select objects with a fair degree of precision and then perform manual operations on them. With increasing interest in accessing information resources available through the Internet and World Wide Web (WWW), the need to find alternatives to mouse/keyboard access has become even more critical. At present, Web browsers feature two-dimensional graphical user interfaces. With the recent adoption of the Virtual Reality Modeling Language by major browser vendors, the Web is increasingly becoming a 3-D environment. Efficient input devices and control programs that are suitable for navigating 3-D spaces simply do not exist today. Conventional mice and trackballs are difficult to use in 3-D and user disorientation in virtual spaces is commonplace. The implication is that, if existing mouse/keyboard emulators for persons with disabilities are not currently up to the task in working in 2-D environments, they will be woefully inadequate in working in 3-D environments. Until new interface technology is developed, access to the 3-D Web will be difficult for everyone, but especially difficult for persons with disabilities. New navigation and control input technology will have to be developed, and this has motivated our current research to create new, alternative control technology for a variety of applications and consumers.

Moreover, accessibility to the Web and the huge library of commercial and other software applications will not be achieved by re-engineering these applications one-by-one to handle different input device technology, but will require "middleware" software to act as an intermediary between various input modalities and the target applications. The only practical solution is to produce adaptive device technology that can be configured like any computer peripheral and that does not depend on special allowances or "hooks" in the target computer program to meet the special needs of consumers.

With this as background, we have recently begun developing technology for individuals with specific and possibly multiple disabilities that will allow combined eye-tracking and voice input to allow hands-free interaction with a standard window-based computer application. We call this system the Eye/Voice Aware interface or EVA. The user of our system will be sighted, but, it will not be necessary for that person to be able to hold his/her head stable as in some current assistive devices. The user of a computer configured with EVA may have some speech impairment. EVA is designed to use existing commercial software as it is, but replaces the traditional mouse and keyboard control with device technology that is appropriate for the user. EVA effectively collects the input signals from these alternative control devices and emulates the standard mouse and keyboard signals that the target software application is expecting.

While there are a number of assistive device products on the market that perform the same function as a mouse or keyboard (e.g., a head pointer or optical mouse used in combination with reduced keyboards), there are only a few products that integrate multiple input modalities (see, for example, [1]), and no products, to our knowledge, that integrate eye-tracking and voice recognition.

2. EyeTalk(TM) Implementation

The EVA system is built on top of general purpose "middleware" software that we have been developing, called EyeTalk(TM). With EyeTalk(TM), the user issues verbal commands and requests while visually attending to the display. Input from one or more modalities (eye, voice or keyboard) are timestamped according to their time of occurrence and passed to a fusion component that produces a single message of user intent. The fusion component effectively reconstructs the interaction event set as the user intended it by correlating events in time. For example, in using a web browser, the interaction consisting of uttering "click" and fixating a hyperlink, sets off a variety of processing steps, many of which can be accomplished concurrently:

(1) comparing the timestamp of the utterance as a whole (or more generally, the utterance of a specific word) to the eye point-of-gaze (POG) at the same time to find where the user was looking when the utterance was made;
(2) identifying the user interface control that is closest to the user's POG at the time the utterance is made; and
(3) generating feedback to the user in the form of a visual or auditory understanding response after the combined eye/voice message has been interpreted.

The EyeTalk(TM) software currently runs on 32-bit Windows(TM) and some Unix platforms. A user "fixation" is considered to occur when the eye-tracker computed POG remains within a relatively small spatial area for a minimal period of time. The computations that result in the displayed position are based on the eye-tracker computed POG for the last one to two seconds, provided that a saccade (rapid eye movement involving a relatively large spatial displacement) has not occurred. If saccadic movement is detected by the algorithm, a new fixation period is initiated and the algorithm considers only the computed POG from the end of the saccade. The displayed cursor position represents a weighted average of all the POGs since the beginning of the fixation period.

3. Use of Multi-Modal Interaction in Computer Interfaces

Today's personal computers are based on the WIMP (windows-icons-menus-pointing) model; primary input is via a mouse or other pointing device and the keyboard. He and Kaufman [3] rate computer input devices in terms of the number of degrees of freedom they offer and the functionality (i.e., locating, choosing, commanding and valuating) that they provide. They rate eye-tracking as suitable (but not ideally suited) for spatially locating objects and entering choices, while voice input is ideally suited for command entry. Especially noteworthy is that between the two modalities (voice and eye-tracking), all major functions performed by input devices are covered.

Eye-trackers have been used in medical research and in high performance flight simulators for a number of years. Considerably less experience exists in integrating eye-tracking in the user interface in a more general way. Starker and Bolt [6] and Jacob [4] were among the first to investigate ways of using eye movement as an integral part of human-computer dialog. Eye gaze has been proposed for use in assistive devices to address several broad functional areas, including general communication and speech synthesis, and environmental control. Voice recognition technology, while having advanced dramatically in the last decade, is still far from perfect. (Even human hearing is only 97% accurate on single word recognition tasks under ideal listening conditions.) Most systems do not perform well in high noise environments and, in the case of speaker independent systems, may fail to work well for individual speakers. Recognition accuracy may deteriorate in the presence of speech drift (which naturally occurs in all speakers); if a user has a moderate to severe speech impairment; or if a user breathes with the assistance of a ventilator. Speaker dependent systems (that must be trained) may overcome some of these problems. Recently, multi- modal interaction has drawn increasing attention in the human-computer interaction research community at large. Koons et al. [5] describe a prototype system that combines speech, gesture and eye-tracking. In our own work [2], we have combined voice and eye gaze to control a simulated cockpit display. While eye- tracking is useful for pointing, it is not as effective for entering choices and commands. On the other hand, the spoken word is a medium capable of communicating a much richer set of user intentions than eye movement, but it is a poor device for pointing. The problem with using voice alone to refer (point) to objects is that each object must be uniquely identified, i.e., named to resolve referential ambiguity.

4. Initial EVA Implementation

We implemented the EVA concept using our EyeTalk(TM) eye/voice aware software on a Windows(TM) 95 platform. We established as a goal early in the project to develop an approach that would work with any Windows(TM) application. The implication of this decision is that we need to support interaction with all the standard Windows(TM) controls, i.e., list boxes, combo boxes, buttons, edit boxes, etc. While we did not achieve this goal fully (see the reason below), we were able to duplicate most of the standard Windows mouse and keyboard functionality using integrated eye/voice input. In EVA, we implemented a mode of operation where the user can slave the screen cursor to his/her eye. Wherever the user looks on the display while operating in this mode, the cursor will be displayed. We have noted, and consumers have corroborated, that it can be difficult to control the cursor in the current system implementation. For example, to push a button, the user fixates a Windows(TM) button and says "click" or "enter." Since control of a screen displayed cursor is still more difficult than we would like, we added the capability to "snap" the cursor to the closest recognizable control, which allows the user to get away with imprecise positioning of the cursor, and reduces the fixation dwell times required.

EVA is intended to be hardware independent, meaning it will work with any voice recognition system or any eye-tracking system. Thus far, we have used two different speech engines: the Speech Systems Inc. Phonetic Engine 500 and the IBM VoiceType. Both systems support speaker independent recognition and relatively large vocabularies (20,000 to 50,000 words). The VoiceType also provides support for a speaker dependent mode. We are currently using an Applied Science Laboratories SU 4000 head-mounted eye- tracking system, which is configured with a head tracking device so that the eye position with respect to the eye camera can be integrated with the head position to yield eye point-of-gaze with respect to the computer screen. As expected, consumers find wearing the head-mounted optics and a head-tracking sensor a bit uncomfortable. While we use a head-mounted system in our current work, we believe the future is in desktop-mounted eye-tracking.

We tested the software with several Windows-based applications, including Microsoft's Internet Explorer (IE) Web browser and an application designed specifically to work with voice alone. With IE, users can navigate the various hyperlinks in a web page by "tabbing between them" with the command "forward tab." They can also fixate a particular hyperlink and utter "click" which has the effect of clicking with a mouse on the hyperlink. Users can also scroll the browser by issuing "page up" and "page down" commands, open list boxes and select various items. Maintaining POG sufficiently long while the system first correctly interprets a spoken utterance and then correlates it with the interface control under the user's POG may be difficult for some users initially. Unfortunately, even when the user is fixating an object, his/her eyes are moving about rapidly and it is rather unsettling to have the cursor reflect this variation back to the user. To improve cursor control, we implemented an adaptive eye gaze filtering algorithm that allows one to adjust a "recency" parameter, which gives more or less weight to the more recent point-of- gaze estimates as computed by the eye-tracking system. By increasing the parameter value, the system will be more sensitive to recent changes in computed POG; by lowering the parameter value, the system will respond more slowly to changes. Irrespective of the setting of the recency parameter, the system will detect rapid eye movement, e.g., saccades and adjust very quickly to gross changes in POG. While we are continuing to improve this algorithm, we found that users could fairly easily learn to control the cursor with their eyes, even when the individual's POG was not accurately calibrated.

The software developed thus far will work with any Windows(TM) application provided the software designers have used standard Windows(TM) controls. For example, if the application designers use the standard list box control provided in Microsoft Foundation Classes (MFC), our software will be able to determine that this is a "legitimate" control and will therefore know what commands this control can respond to. On the other hand, if the application designers derived specialized classes from the standard control, or developed a composite interface object that blocks access to the constituent objects, EVA will not be able to identify the class of the control and will therefore not be able to determine what commands it can respond to. This has important policy implications for specifying software accessibility standards and is one of the goals of this research: to encourage application developers to design user interfaces that contain objects that "advertise" their class membership (e.g., combo box, drop down list, etc.). With this information, other software applications can provide alternative means for accomplishing the same functions that can now be accomplished with mouse and keyboard.

5. Consumer Evaluation

To obtain feedback on system design early in the process, we have been working with consumers contacted through the Endependence Center of Northern Virginia (ECNV). ECNV is a consumer-run Center for Independent Living (CIL) designed and operated as a community-based organization by individuals with disabilities. Consumers that agreed to participate in the evaluation have a broad range of physical attributes for whom hands-free control will enhance access to computing resources. The following are some of the types of disabilities and mobility limitations represented: spinal cord injury with high level quadriplegia or paraplegia; fibromyalgia with upper limb weakness and numbness, low stamina; multiple sclerosis with weakness and low stamina; cerebral palsy with upper limb and head spasticity and mild speech impairment; muscular dystrophy with upper extremity weakness and/or paralysis and low stamina; juvenile arthritis with upper limb limited range of motion and strength. In evaluation sessions, consumers performed an information gathering task involving navigation with a Web browser. Two individuals in the evaluation group had severe speech impairment and it was not possible to reliably recognize their utterances while operating the speech recognition system in "speaker-independent" mode. We are currently investigating the use of speaker-dependent voice recognition to improve recognition accuracy for these individuals. Due to involuntary head movement, another individual had difficulty maintaining a stable head position while the eye-tracking system was being calibrated, which significantly reduced the accuracy of the eye POG estimate. It is important to note that with the present head-mounted system, the head need only be stabilized during calibration; involuntary (or voluntary) head movement during use does not present a problem. Another consumer (who also has a severe speech impairment) has limited control of her left eye, which is the one that is imaged by the eye-tracker. Either eye can be imaged, but it requires a re-configuration of the head band to do so.

Overall, reports were very favorable from this first implementation. Participants were uniformly very enthusiastic about the technology. Six of the eight consumers that participated in the evaluation were able to operate the interface as well or better than the developers with just a few minutes of practice. As expected, individuals noted that operating the current interface requires some effort in terms of maintaining eye point-of-gaze while issuing the appropriate commands. The evaluation sessions pointed out the need to incorporate speaker-dependent voice recognition in our system right away, as well as the need to develop a calibration procedure that does not require the user to keep his/her head stable during the short 10 second process.

6. Summary

The assistive technology that is expected to arise out of this work will facilitate access for persons with disabilities to the extensive library of commercial and other software that already exists. Perhaps most important, we are gaining insight into the requirements and design specifications that can help define and drive the adoption of concrete computer accessibility standards that will make use of emerging multi- modal interface device technology. The initial EVA implementation demonstrates that standard Windows(TM) applications can be made eye/voice aware without fundamentally re-engineering the application itself. However, the inability to access the software class type of a standard user interface control in some cases presents a barrier to use of this approach. The solution, in our view, is to require software developers to advertise class membership of their components so that the commands they respond to are accessible to any input. We do not believe this is an onerous requirement. Indeed, this is the direction being taken in recent object technology standardization initiatives (e.g., CORBA).

Acknowledgments:

The authors gratefully acknowledge the contributions of the Endependence Center of Northern Virginia, a Center for Independent Living, for their support in evaluating the EVA interface.

References

[1] Cook, Albert M., Hussey Susan M. and Mary Currier, "A Headcontrolled Dynamic Display AAC System," Proceedings of the RESNA '95 Conference, Vancouver, BC, June 9-14, 1995, pp. 106-108.

[2] Hatfield, F., Jenkins, E.A., M.W. Jennings and G. Calhoun, "Principles and Guidelines for the Design of Eye/Voice Interaction Dialogs, "Third Annual Symp. on Human Interaction with Complex Systems," IEEE, Dayton, OH, 1996.

[3] He, Taosong and Kaufman, Arie E., "Virtual Input Devices for 3D Systems," Proceedings Visualization '93, IEEE Computer Society, San Jose, CA, October, 1993, pp. 142-48.

[4] Jacob, Robert J.K., "The Use of Eye Movements in Human-Computer Interaction Techniques: What You Look At Is What You Get," ACM Transactions on Info. Systems, Vol. 9(3), 1991, pp. 152-169.

[5] Koons, David B., Sparrell, C. J. and K. R. Thorisson, "Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures," in M. T. Maybury (Ed.) Intell. Multimedia Interfaces, AAAI/MIT Press, Menlo Park, CA, 1993.

[6] Starker, I. and Bolt, R.A., "A Gaze-Responsive Self-Disclosing Display," Proceedings CHI '90 Human Factors in Computing Systems, ACM Press, Seattle, WA, 1990, pp. 3-9.

Go to the top of this page. | Go to the upper category.