RESNA '96 Proceedings.(Page123)

A PILOT STUDY FOR MULTIMODAL INPUT IN COMPUTER ACCESS

John Dunaway, Patrick Demasco, Denise Peischl, Alice Smith Applied Science and Engineering Laboratories University of Delaware/A.I. duPont Institute

Abstract

Multimodal input in user interfaces has been suggested as a means to obtain more efficient and robust interaction. One promising combination for individuals with good head and voice control is speech recognition and head-pointing. While each modality combined with the proper technology can be used to control keyboard and mouse input, integration has a potentially synergistic effect. This paper describes the overall research approach to studying integration of speech and head-pointing and the results of a pilot study used to refine the experimental tools and protocol. The research is based on the premise that speech recognition will have a natural advantage for typing text and that headpointing will have a natural advantage for mouse actions such as dragging. The first planned study will test this premise through a comparison of each modality for each class of task. However, given the significant differences between the devices, it is essential to understand the differences in learning curves for device use. The pilot study examined changes in performance over a large number of trials and supports (but does not prove) that a study with approximately five trials will provide useful comparisons.

Background

Many people with motor impairments (e.g., spinal cord injury) face difficulties in efficiently accessing computers and other information systems. One approach to overcoming these "bandwidth" limitations is to combine more than one modality in a synergistic way [3]. For example, in recent years the Apple Macintosh AV series has been available with simple speech recognition for common commands and controls. This can in may cases save the user from switching between the keyboard and the mouse. Multimodal input has been investigated by a number of researchers in the AT community. Trevianus et al. [5] investigated the integration of speech with traditional single switch scanning and showed an increase in selection rate despite a limited number of vocalizations available from the subjects. Cairns et al. [1] developed a computer access environment that integrated speech, gesture, eye-gaze and pointing. Kazi et al. [2] have investigated multimodal control (using speech and gesture) of assistive robots that is combined with an artificial vision system and reactive planner. In reviewing multimodal approaches a number of potential benefits become apparent:

An increase in bandwidth can result in improved speed
An increase in redundancy may result in improved accuracy
Multiple methods for achieving tasks provides 1) greater choices for users; 2) greater flexibility to changes in situation (e.g., environmental noise); and 3) additional ways to cope with fatigue
An increase in naturalness of interaction may result in greater satisfaction and ease of use.

Because the nature of multimodal user interfaces is combinatory, there is a rich set of opportunities for research. Each particular combination of devices will bring a unique set of issues.The research discussed in this paper addresses the specific combination of speech recognition and head-pointing. This combination is appealing because while products based on these modalities are both capable of mouse and keyboard emulation, it seems intuitive that they are each better suited to one of the tasks. Both keyboard and speech recognition-based typing are discrete tasks while mouse use and headpointing are continuous tasks. In addition, speech recognition systems are approaching the ability to accurately recognize large vocabularies without spelling.

Approach

The overall goal of this research is to explore issues arising from the combination of speech and head-pointing modalities as embodied in two specific products. As mentioned each device is capable of providing complete access to mouse and keyboard functions. It is hypothesized that each device is better suited to one task. In particular, for most users speech recognition will be a preferred keyboard replacement and a head-pointing device will be a preferred mouse emulating device. If this hypothesis is proven correct then one can infer that integration will be beneficial. However, it will be also useful to explore these technologies in an integrated form to explore other possible benefits beyond using each device for its preferred task. For example, while users may prefer to use head pointing to control cursor position, they might prefer speech recognition for button actions (e.g., double-click). The research is being conducted in three major phases: 1) Both devices used separately for typing; 2) both devices used separately for mouse-based tasks; 3) combined devices used for a variety of tasks. Each experiment will be performed with approximately 15 non-disabled subjects and 5 disabled subjects. A more detailed discussion of the overall research plan is discussed by Smith et al.[4] which also presents results of a pilot study that examined different device parameters (especially learning configurations) in order to help define the overall protocol. The pilot study discussed in this paper primarily addresses the issue of learning curves.

Pilot Study Objective

For this research, a comparison is being made between two devices that are fundamentally different in their theory of operation, implementation and use. Because of this, there is a potential that actual use of the systems could have dramatically different learning curves. Since it is often impractical to evaluate use over a long period of time among a large number of subjects, it is necessary to limit the number of trials for each subject. If the system learning curves are in fact different, then results can be misleading. This is illustrated by a hypothetical example shown in Figure 1. Treatment A has a much steeper learning curve than Treatment B. If a study were carried out for 5 trials, then Treatment A would be deemed better despite the fact that in the long term (i.e., after trial 9), treatment A was superior. One approach to this problem is to let all subjects reach asymptotic performance. This can either be impractical or difficult to define. In fact the first pilot study of this research [4] showed a great deal of variability in speech recognition performance that would make it difficult to establish asymptotic performance. Hence the purpose of this second pilot study was to explore long term use of both technologies with a single subject to better understand if there are any significant learning curve differences. The results would help define the number of trials used in the larger multi-subject experiment. This experiment was also used to finalize any other issues in the protocol or with the tools being used.

Method

This pilot study included one person who did not have prior experience with either input technology. The subject was a person without a disability and had familiarity with the computer. The subject was trained using the HeadMaster Plus technology with WiViK2 Visual Keyboard Version 2.1b, an on-screen keyboard with word prediction. For speech input, the user was trained using DragonDictate 1.0 for Windows, a discrete speech recognition system based on sound matching. Additional information as to the previous configurations of the input technologies being used can be found in Smith et al. [4] In the pilot study, the subject typed a series of 160- word paragraphs further grouped into sets of three. The subject was given the option to pause in between paragraph sets. To ensure that the subject was ready to begin timing, a verbal cue for the DragonDictate System was given. For the HeadMaster Plus with WiViK the user had to select a specified key to begin timing. During the training on both systems, the subject was given instructions on how to correct for errors. These instructions were derived and refined from the initial pilot study.

Results

The results of the pilot trials are shown in Figure 2.

Although the DragonDictate performance is more erratic it appears to be consistently faster than WiViK. The average rate over seven trials for WiViK was 7.99 words per minute compared to 11.41 words per minute for DragonDictate. For all twelve DragonDictate trials, the average rate was 12.53 words per minute. More substantive for this pilot study, however, the results indicate that both input technologies are providing valid results after the first trial. Any learning component appears to be negligible after the first trial, especially for DragonDictate.

Discussion

After analyzing the results from Figure 2 we were able to determine the range of trials to be used in the formal experiment. We decided to use five trials such that each user will transcribe five data sets, in random order, using both input technologies. The first trial will not be analyzed in order to compensate for learning. Interaction and feedback from the user enabled us to make final modifications to the experiment design. These modifications were incorporated into this study and have propagated to the formal text generation experiment. The WiViK dictionary was augmented from the Brown Corpus. The new dictionary contains 5000 words arranged by frequency. The new dictionary has improved WiViK performance and has not created a bias toward paragraph content. As a result of this study, the procedures for making corrections using DragonDictate were simplified. New procedures emphasize the use of the "scratch that" macro for deleting unwanted utterances and the "choose 10" command for removing the DragonDictate choice list as well as the current word. In addition, the procedures for providing instructions to the user were simplified. Instructions for DragonDictate have been typed onto a template and secured to the computer monitor. A significant modification to the transcription process further emerged from this study. It was decided to transcribe the data sets line by line. Previously, the entire transcription paragraph appeared in the text editing window. The evaluation software was adapted to display a single line of text and advance the cursor two lines. The user transcribes the top line and invokes a `page down' to erase the contents of the window and retrieve the next line.

References

[1] A. Y. Cairns, W. D. Smart, and I. W. Ricketts. Alternative access to assistive devices. In Mary Binion, editor, Proceedings of the RESNA '94 Conference, pages 397-399, Arlington, VA, 1994. RESNA Press.

[2] Z. Kazi, M. Salganicoff, M. T. Beitler, S. Chen, D. Chester, and R. Foulds. Multimodal user supervised interface and intelligent control (MUSIIC) for assistive robots. In IJCAI-95 Workshop on Developing AI Applications for People with Disabilities, pages 47-58, Montreal, Quebec, 1995. IJCAI.

[3] F. Shein, N. Brownlow, J. Treviranus, and P. Parnes. Climbing out of the rut: The future of interface technology. In Beth Mineo, editor, Augmentative and Alternative Communication in the Next Decade, pages 36-39, Wilmington, DE, March 1990. Applied Science and Engineering Laboratories.

[4] A. Smith, J. Dunaway, P. Demasco, and D. Peischl. Multimodal input for computer access and alternative communication. To appear in Proceedings of Assets `96: The Third Annual ACM Conference on Assistive Technologies.

[5] J. Treviranus, F. Shein, S. Haataja, P. Parnes, and M. Milner. Speech recognition to enhance computer access for children and young adults who are functionally nonspeaking. In Jessica J. Presperin, editor, Proceedings of the Fourteenth Annual RESNA Conference, pages 308-310, Washington, DC, 1991. RESNA Press.

Acknowledgments

This work has been supported by a Rehabilitation Engineering Research Center Grant from the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education (#H133E30010). Additional support has been provided by the Nemours Research Programs.

Author address

John Dunaway Applied Science and Engineering Laboratories A.I. duPont Institute 1600 Rockland Road, P.O. Box 269 Wilmington, Delaware 19899 USA Internet:dunaway@asel.udel.edu

Go to the top of this page. | Go to the upper category.