Web Posted on: February 24, 1998

Portable Speech Accessor

Judy Jackson
Stanford University Center for the Study of Language and Information
Cordura Hall
Room 122
(650) 723-3598
jackson@csli.stanford.edu
judy@acm.org

Neil Scott
Stanford University
Center for the Study of Language and Information
Cordura Hall, Room 118
(650) 723-3774
ngscott@csli.stanford.edu

ABSTRACT

In this paper, we describe work that has produced a portable speech accessor capable of controlling multiple computers.

Many computer users are becoming disabled due to repetitive strain injuries, such as carpal tunnel, tendonitis and fibromyalgia. Speech input control of a computer is becoming widely available on the Windows 95 platform due to recent increases in chip speed and decreases in computer hardware prices. However, many people experiment with this technology and fail to make it work not due to any problem with the technology itself, but because of unfamiliarity with the interactions between the hardware and software or because of poorly designed computer-human interfaces. This technology is often not available on the machines where disabled computer users have expertise, or is not available in a uniform way across several different machines. Simpler interfaces for complete control of all computers are necessary.

The speech accessor described in this paper is a part of the Total Access System, developed by Project Archimedes at Stanford University. It is a fully functional solution for the problems mentioned above, and is used on a daily basis by one of the authors to control her machines. Judy Jackson became interested in speech interfaces to computers after becoming disabled 3 years ago due to severe bilateral tendonitis. A computer programmer by trade, she has done the programming necessary to port the software portion of the Total Access System from DOS to Windows 95 using this system.

BACKGROUND TOTAL ACCESS SYSTEM OVERVIEW

There are three basic components in the Total Access System:

an "accessor" which provides the human/computer interface;
a "Total Access Port" (TAP) that provides a standardized input/output port for any target computer; and (3) a Total Access Link that connects any accessor to any TAP.

Accessors translate between the specific needs, abilities, and preferences of each disabled user and a standardized user interface protocol. TAPs translate between the standardized user interface protocol and the particular hardware and software interface of the computer to which it is attached. This approach has the potential to enable any disabled individual to work with any computer-based device.

Accessors adhere to the underlying concept of the Total Access System that all user input and output functions are performed outside of the computer that is running the target application. This differs from the traditional approach of building software solutions for installation on the user's machine for reasons of portability and streamlined access. Portability of the accessibility modifications to a system is difficult and time-consuming because these modifications are typically installed locally on a single machine. Users who require portability between several machines must install multiple versions of these accessibility modifications. In the case of speech software, users must intall, configure, train, and maintain multiple copies of their voice files. Users who require access to different kinds of machines, such as Macintoshes and PCs, must learn several kinds of accessibility packages, each with different features. The Total Access System solves these problems by providing one accessor that becomes the user's interface into whatever machine he/she wishes to use.

Researchers at Project Archimedes have developed accessors using speech recognition, head tracking and eye tracking, and TAPs for PC, Mac, Sun, SGI and HP workstations. (See the paper entitled "The Total Access System", by Neil Scott, for a comprehensive overview of the input functions of the Total Access System.)

THE SPEECH ACCESSOR

This portion of the paper describes the speech accessor in its current form, current work and future directions.

A speech accessor in use today by one of the authors consists of the following hardware and software:

Hardware: at minimum, a Pentium 133 laptop with 48 Meg RAM, 1 Gigabyte hard drive, and the standard SoundBlaster compatible on-board sound chip supplied with the machine. * Software: Windows 95, DragonDictate(TM) Versions 2.5.2, 3.0.1, and the Bridge software developed at Project Archimedes.
A TAP for each machine to be voice controlled.

The speech accessor software in its current form runs on Windows 95. Two separate programs run on the accessor; a speech engine and a communications program called the Bridge. The Bridge sends commands issued on the accessor to the TAP, which emulates the keystroke commands possible on the target system. DragonDictate(TM) was chosen as the speech engine because of its extensive macro language and its programming interface. The syntax of the macro language is similar to that of Visual Basic, and so is relatively simple, yet powerful. The Bridge software runs as a separate program from the speech engine in order to keep the interface into the speech engine general. This design ensures that any method of input that can send keystrokes into a window on the accessor, for instance another speech recognition engine, or someone typing on the keyboard, will have the ability to control the target machine.

The speech environment consists of commands to control the functioning of the Bridge software, and commands to control the target machine(s). In a case where there are several target machines, a switchbox controlled by voice can be added. A parallel cable run from the accessor into the switchbox accomplishes the voice control of the switchbox. The user utters a voice macro, such as "talk to the mac", to send a signal to the switchbox, which switches to controlling the keyboard and mouse of an attached Macintosh.

THE DESIGN OF THE SPEECH ENVIRONMENT

The design of the speech environment has been governed by several principles.

Commands must be restricted to small sets at each point in time, to reduce misrecognitions. Command misrecognition that can be simply annoying for a dictation system can be disastrous for a hands-free command and control system because voice commands can potentially start unwanted programs or exit desired programs. DragonDictate(TM) provides vocabularies and subvocabularies, called groups, of commands that can be activated on a per-program basis. Users can activate specific groups of commands by voice, like so:

[enable C++ commands]
[disable C++ commands]

Likewise, commands to control particular applications can be activated and deactivated with simple voice commands. A general set of commands likely to be used at any point in time, regardless of which machine one is controlling, has been created. This includes commands such as these:

[talk to the mac]
[talk to the pc]
[talk to the sun]
[talk to the sgi]
[leave bridge]
[bring up bridge]
[my name]
[my email address]
[my phone number]

Commands that are appropriate for the Macintosh become available after the [talk to the mac] command has been uttered. An example of such commands is:

[apple menu]
[file menu]
[edit menu]
[quit app]

In addition to these basic voice commands for controlling general operating system functions, full mouse control including clicking and mouse motion, is provided with commands similar to the following:

[press button]
[release button]
[click left]
[click right]
[double click]
[run left]
[run right]

The system has similar voice commands for each of the supported target machines (Macintosh, PC, SGI, Sun and HP workstations), thereby providing dictation and complete control, including mouse control, of 5 different types of machines. With intelligent choices of voice commands, it is possible to create an environment in which the user can work faster than someone using a computer without voice. Current work is in progress to extend the capabilities of the speech environment in this manner. Part of that work involves integrating continuous speech with discrete speech.

Computer to human speech interactions can be characterized as both discrete speech and continuous speech phases. The user speaks discretely when issuing a command to the system, such as "Wake Up", "Get My Mail", or "hello". This phase consists of many small utterances separated by silence. Users controlling a computer typically utter discrete phrases. Users dictating usually speak in increasing longer phrases, or continuously.

Continuous speech recognition engines became widely and relatively cheaply available recently. However, these products are not currently suitable for complete command and control of a computer. Some are lacking are hands-free correction abilities. Those that provide hands-free correction are still difficult to use, in that they allow the user to get into a state where only certain obscure commands are recognized. Another method of interaction is needed in order to provide a completely hands-free environment.

Integration with a continuous speech engine would remove the need to learn to speak in a rhythm of discrete utterances during the dictation phases of user interaction. The first widely used speech products were discrete. Recently, continuous speech engines have become available and work directly with the Total Access System without modification. However, the correction features of these new speech engines are not user-friendly. Work is in progress to make a hands-free, user-friendly environment that combines both efficient command and control of a machine as well as continuous dictation.

FUTURE DIRECTIONS

Using an even smaller machine, such as a wearable computer can enhance the portability of the current speech accessor. Wearable computers exist today, and are in use by technicians such as aircraft mechanics and factory workers who need hands-free access to the information a computer can provide while doing their jobs. Many of these computers are not suitable for use within the disabled community. Their visual interfaces are small, and often don't have high resolution. Because they are made for the general population of factory workers, their fasteners are made to be rugged but also difficult to open and close. The speech environment that can run on these machines is not entirely hands-free, nor is it large vocabulary. A light, portable computer of the type available today as a palmtop could function as a speech accessor. It would require a faster processor than the current subnotebooks, as well as more memory and the capability for speech input and output. Project Archimedes has begun testing wearable computers, microlaptops and palmtops, and recommending modifications to vendors for their next generation computers.

A C/C++ programming has been configured, with coding constructs entered in by hand as a proof of concept. It has been shown that building a program, for instance the typical "Hello world!" program, which simply prints the text "Hello, World!" to the screen, can be done faster with voice than it can be typed. The work of creating these voice commands needs to be automated. A possible solution would be a program that takes the description of a programming language, for instance, C++, and creates voice commands for the individual constructs in the language. This could be built into a compiler for the language, or a separate program.

An intelligent macro monitor that creates commands for frequent tasks would be a good addition. This "monitor" would filter the keystrokes made by the user, determine patterns that are likely to recur, recommend voice commands to be made, then create and store them for the user. It can be an autonomous agent, running separately on the computer, and could have general applications for the general population as well as for the disabled community.

ACKNOWLEDGMENTS

The authors wish to thank Hewlett Packard, Boeing Corporation and Stanford University for participating in a study to determine the effectiveness of the speech accessor in enabling people suffering from musculoskeletal disorders to return to their full capacities; and Toshiba for donating a Libretto palmtop for our evaluation as an accessor.

REFERENCES

[1] Scott, Neil. Using the Total Access System to Access the World Wide Web. Paper and Presentation to the World Wide Web Sixth International Conference (April 1997).

[2] Scott, Neil. Universal Speech Access, in Proceedings of Speech Tech/Voice Systems Worldwide, 1992.

[3] Scott, Neil. The Universal Access System, Presentation at the American Voice Input/Output Society Conference, Atlanta, September 1991.

[4] Scott, Neil, Jackson, Judy. Epcot Center, Disney, Orlando, Florida. Discover Magazine Discovery Awards, Hardware Section, May-June 1997). Demonstration of TAS System which was a finalist in the Hardware category for the Discover Magazine Technology Awards.

Go to the top of this page. | Go to the upper category.