MULTILINGUAL TEXT-TO-SPEECH SYSTEM
WITH NEW AUDITORY USER INTERFACE
FOR MICROSOFT(R) WINDOWS(R)

Takayuki Watanabe and Tuneyoshi Kamae
Department of Physics, School of Science, University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN
watanabe@phys.s.u-tokyo.ac.jp
kamae@phys.s.u-tokyo.ac.jp

Tooru Kurihara
Department of Information Sciences (for the blind), Tsukuba College of Technology
GHG02035@niftyserve.or.jp

We are developing new multilingual speech systems (Voice-IME, Voice-Meadow, and Voice-WSH) that enable visually impaired persons to use computers running Microsoft(R) Windows(R). This environment uses elaborate Auditory User Interfaces that render contextual information of the content of an application into 3-dimensional Text-to-Speech audio space with various auditory formatting, which improves a human-computer interface through the sense of hearing. In addition to the above purpose, the current system can be used for other fields such as mobile computing where the sense of hearing plays an important role. Development is still in progress but essential functions have been already implemented.

Introduction

Computers became friendly with the advent of Graphical User Interface (GUI); visually impaired users, however, can not benefit from GUI. Traditional screen-readers are ineffective for complicated GUI because they do not care about the contextual structure of the visual output but speak the displayed characters as a plain text.

In 1994, Dr. Raman built up a new system, ASTER (Audio System for TEchnical Readings) (ref. Raman1994). ASTER is a computing system that renders the original content of an application into auditory space with various audio-formatting styles. Raman extended this approach to more general computing tasks and released a new speech interface, Emacspeak (ref. Emacspeak and Raman1997), as an Emacs subsystem. In Emacspeak, user interfaces are separated from the contents and they arrange the contents so that the information is effectively presented for the sense of hearing. In other words, Emacspeak acts as an effective Auditory User Interface (AUI). These systems, however, can not treat Japanese and can run only on Emacs.

Thus, a new auditory rendering system based on Raman's work but can treat Japanese applications on Microsoft(R) Windows(R) is strongly required by Japanese visually impaired users.

Development of a new system

1. Voice-Meadow

We are developing a new multilingual Text-to-Speech (TTS) system with a 3-D AUI, Voice-Meadow. Voice-Meadow is a natural extension of Emacspeak to Microsoft Windows. Meadow (ref. Meadow) is a multilingual Emacs for 32-bit Windows and is based on Emacs 20.

Voice-Meadow does not use speech recognition as an input method because a computer expert can use a keyboard regardless of one's visual ability and a keyboard is the most efficient input method for an expert.

The way of rendering the contextual information of the contents into auditory output is taken after that of Emacspeak. So, for example, Voice-Meadow offers an effective AUI for browsing and searching a calendar. It also can place outputs of Meadow's three frames (windows) in three different auditory-space, which enables the user to switch from one frame to another frame immediately and to be able to handle three different tasks. To cite another example, the text under edited can be assigned to a male voice with a bold voice for bold fonts, while other information such as the current line number can be assigned to a female voice.

Additional functions of Voice-Meadow are treatment of multilingual inputs and multilingual audio outputs. Since Japanese input is treated by IME independent of the application, auditory rendering of IME is treated by a dedicated separate application, Voice-IME. Audio formatting is used to notice the status of IME to a user, i.e. whether he is inputting Japanese characters or not.

TTS server, a speech server of the current system, uses DirectSound(R) as a multimedia device and is capable of handling simultaneous inputs and 3-D output. It uses Microsoft's TTS engine (mode) when speaking English and Toshiba's engine (mode) when speaking Japanese. TTS functions are realized with use of Microsoft Speech SDK.

TTS client, an interface between Meadow and TTS server, can be used outside Meadow at DOS prompt. When standard outputs of "dir" or "type" commands are put into the standard input of TTS client through a pipe, TTS server speaks the directory information or a content of the file.

2. Voice-WSH

Voice-Meadow is intended for experts because Meadow is not an easy application. There are, however, a lot of visually impaired persons who are not good at computers. Thus, a new system that will be a natural extension of Voice-Meadow to universal Windows applications is required.

Voice-WSH is designed to fulfill such demands. As clearly shown in Emacspeak, an intelligent AUI is indispensable to this system. We use Microsoft Visual Basic(R) for Applications (VBA) and Microsoft Visual Basic(R) Scripting Edition (VBScript) with Windows Scripting Host (WSH) as a rendering language of the AUI. VBA is not supported by all applications but major applications such as Microsoft Office. Since VBA and VBScript is a subset of Visual Basic, they are easy and users can customize AUI for their own purposes.

3. Application to other fields

Voice-Meadow can be used on Unix platforms with an appropriate speech device. The current TTS system can apply to other welfare such as helping people who have difficulty in reading. It also can apply to mobile computing such as medical rounds by doctors, wearable computers, and in-car use (e.g. navigation system). The application of this system to these markets will include a speech recognition system or a dedicated speech device.

Concluding remarks

The current suit of Voice system (Voice-IME, Voice-Meadow, and Voice-WSH) is still in developing stage but essential parts have been implemented. The current system, based on Emacspeak, is a TTS system much wider than the currently available commercial screen-readers such as ProTALKER(TM)97 and 95Reader. The heart of the current system is a context sensitive interactive AUI, which was implemented only by Emacspeak other than the current system.

Aural output is like an auto-scrolling display because it is temporal. At the same time, its eyes-free interaction is suitable for mobile computing. Thus, if a computer incorporates independent aural and visual interfaces, it will break a new world.

References

(Raman1994): T.V. Raman: "Audio System for Technical Readings", Ph D thesis, Cornell University (1994).
(Emacspeak): Emacspeak package is found at http://simon.cs.cornell.edu/Info/People/raman/emacspeak/.
(Raman1997): T.V. Raman: "Auditory User Interfaces -Toward the Speaking Computer-", Kluwer Academic Publishers (1997).
(Meadow): Meadow, Multilingual enhancement to gnu Emacs with ADvantages Over Windows, was written by Miyashita Hisashi and edited by Takeyori Hara. You can get Meadow at http://mechatro2.ME.Berkeley.EDU/~takeyori/meadow/.

Go to the top of this page. | Go to the upper category.

MULTILINGUAL TEXT-TO-SPEECH SYSTEM WITH NEW AUDITORY USER INTERFACE FOR MICROSOFT(R) WINDOWS(R)