音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

Web Posted on: March 3,1998


THE DEVELOPMENT OF A VIDEO TAP TO PROCESS THE VISUAL DISPLAY INTO NON-VISUAL MODALITIES

Janice L. McKinley
Western Blind Rehabilitation Center
VA Medical Center
3801 Miranda Ave.
Palo Alto, CA 94303
Voice/Message: (650)493-5000 ext. 64379
Internet:mckinley@csli.stanford.edu

Neil Scott
Archimedes Project
Center for the Study of Language and Information
Stanford University
Stanford, CA 94305-4115
Voice/Message: (650) 725-3774
Internet: ngscott@arch.stanford.edu

The VideoTAP is a component of the Total Access System (TAS) that will retrieve information from the screen display of a computer. It is being designed to work in with another total access system component called the GUI accessor to enable a blind person to operate a computer that is otherwise inaccessible. The focus of this project is to design a prototype system that will work with generic Graphical User Interfaces (GUIs). The VideoTAP and GUI accessor adhere to the underlying concept of the Total Access System that all user input and output functions are performed outside of the computer that is running the target application.

The rationale behind the Total Access System approach is that current computers do not provide adequate stable resources to support long-term universally accessible access solutions. Inputs and outputs (I/O) are handled differently on every platform; hardware support for I/O is already at maximum usage on platforms such as the PC; operating systems are evolving faster than access software can be designed; and many applications software designers implement programs that break access tools.

We see the VideoTAP complementing existing screen readers that work successfully in a specific GUI environment such as Microsoft Windows 95. An important aspect of our philosophy is to include the options to use existing solutions or yet to be developed access software and hardware with the target computer and the GUI Accessor. The VideoTAP would be a resource when other methods of access are incompatible or unavailable. Of significant concern is not only access to GUIs found on computers, but also those that are being implemented in appliances, information kiosks, and television/internet interfaces.

The Total Access System provides plug-and-play capabilities for adding and removing user input and output devices to computers. It cleanly separates the human/computer interface requirements of a user from the interfacing conventions of any computer he or she may wish to use. There are three basic components in the Total Access System: (1) an "accessor" which provides the human/computer interface; (2) a "Total Access Port" (TAP) that provides a standardized input/output port for any target computer; and (3) a Total Access Link that connects any accessor to any TAP. Accessors translate between the specific needs, abilities, and preferences of each disabled user and a standardized user interface protocol. TAPs translate between the standardized user interface protocol and the particular hardware and software interface of the computer to which it is attached. This approach has the potential to enable any disabled individual to work with any computer-based device.

The Archimedes Project has developed accessors using speech recognition, head tracking and eye tracking, and TAPs for PC, Mac, Sun, SGI and HP workstations. (See the paper entitled "The Total Access System" for a comprehensive overview of the input functions of the Total Access System.)

The proliferation of GUIs in many computer platforms has forced blind computer users to work in these environments. In order to access information outside of the operating system the VideoTAP must be capable of providing meaningful information from GUIs.

Part of this project is focused on processing and presenting information so that it is intuitively meaningful for the user. The user's needs and preferences are an integral part of the appropriate computer interface along with available technology. We are developing a flexible system that can be designed for an individual user, while at the same time allowing access to a number of different computer and operating systems.

We believe that a multi-modal approach will ultimately be the best way of processing information from a GUI to a non-visual modality. We are processing information at it's most basic level into meaningful chunks that can then be presented to the user in various modalities. For example, a blind computer user may need to access a GUI display while editing audio information and prefer to minimize additional sources of sound in the computer. One of our investigators has designed a haptic display which enables this kind of feedback.

The Total Access System approach simplifies the complex problem of accessing many differing operating systems by dividing the problem up into several components. The approach we have adopted ignores the operating system completely by retrieving all of the necessary information from the image displayed on the computer screen. The complexity of the GUI is systematically reduced by breaking the access process into a well-defined sequence of relatively straightforward steps. The first step is to retrieve selected parts of the screen image (areas of interest) from the video signals that drive the computer screen. Multi-modal navigation tools enable the user to quickly identify areas of interest on the screen and to tell the system how to process the selected information. The second step is to transform the recovered screen data into static representations analogous to scanned documents. The third step is to recover the information contained in each of the screen segments using image processing, pattern recognition and optical character recognition (OCR). The fourth, and final step, is to transform the recovered screen information in a form that is meaningful to the user. Synthesized speech, tactile (Braille), haptic, and sonic representations can be used individually or in a variety of combinations.

Steps one to four are shared between two components; the VideoTAP, which connects to the video output of the target computer, and a GUI accessor that is operated by the blind user. Quite different design constraints exist for each of these components. The

VideoTAP is intended to be a ubiquitous device that is always available on any computer blind individuals may want to use. Consistent operation, a standard interface, and low cost are therefore critical design constraints for the VideoTAP. In contrast, the GUI accessor is a very personal device designed to match the needs, abilities and preferences of each user. Cost may be of secondary importance compared to having the ability to choose preferred methods for navigating and presenting screen information.

Achieving an effective operating speed is one of the major criteria for deciding which hardware and software functions are handled in the VideoTAP and which are handled by the GUI accessor. Many different sequential processes must be performed to translate information from the screen to the user. We are exploring a variety of strategies for reducing the time required to perform the individual processes. The overall strategy is to restrict the amount of screen information that is captured and analyzed at any instant to just the information necessary to perform the desired function. For example, locating windows on a screen requires knowledge about window boundaries but not the contents. Identifying the function of a window requires the additional step of recognizing the text contained in the title bar. Reading the text displayed within a selected window requires each of the preceding steps plus optical character recognition of the contents of the window and screen reading functions for navigating the text. The crucial feature of our approach is that we perform many small, strategically chosen operations very quickly, rather than the much slower approach of capturing and analyzing complete screens of data.

The data link between the VideoTAP and the GUI accessor is a critical part of the Total Access System. In reality, this link is a fully implemented network that must handle many different input and output transactions in real time. It is obvious that the overall system response will be faster if network traffic is kept to a minimum by performing most of the image analysis in the VideoTAP and thereby reducing the amount of data that must be sent to the GUI accessor. However, increasing the amount of processing performed in the VideoTAP would reduce the amount of flexibility available to designers and users of the GUI accessors. Part of our project is to determine where the most cost efficient processing should take place.

We are currently implementing proof-of-concept prototypes of the Video TAP and GUI accessor, using commercially available components wherever possible. Our initial goal is to demonstrate all of the necessary functions on a carefully designed selection of test images displayed on a particular screen. Almost all of the necessary software functions exist as components of various graphics and CAD programs. We plan on passing files from one application to another and using the native scripting language within each application to perform the necessary operations.

Prototyping the hardware will be a much more involved than prototyping the software. There are two main problems to be solved in the hardware design. The first problem is that the screen capture hardware must extract very high frequency pixel data from the video signals. The second problem is that the system must automatically accommodate the very wide range of vertical and horizontal scan rates encountered with the different video modes on the different computer platforms. While it is possible to solve these problems in a very direct manner using existing graphics controller chips and processing techniques, the resulting hardware would be too expensive for the planned application. By recognizing some of the special characteristics of the screen reading operation, we have devised a low-cost strategy for recovering the necessary data from the video signals. This low-cost version is still quite complex, however, so we plan to implement an even simpler strategy for the first hardware prototype.

Our first hardware prototype of the VideoTAP consists of three items: 1) a commercially available computer-to-TV signal converter that is connected to the screen feed on the target computer; 2) a commercially available frame grabber that converts the TV signals to a bitmap; and 3) a personal computer that processes the recovered bit map data and emulates the operation of the VideoTAP. The final version of the VideoTAP will probably be designed around a fast digital signal processor.

The hardware prototype of the GUI accessor uses another two personal computers. The first computer applies OCR to text that is made available to the blind user through a conventional screen reader and speech synthesizer. The second computer extracts graphical information from the data produced by the VideoTAP and generates force vectors to drive a haptic output device. The final version of the GUI accessor will probably use a notebook computer or a device like a Braille 'n Speak as the primary accessor and an embedded DSP and/or microprocessor to drive haptic display. At a later stage, we plan on adding a third computer to the GUI accessor that will derive sonic information from the data produced by the VideoTAP and present it to the blind user.

The VideoTAP will always be totally transparent to the user. It will simply monitor the video output signals from the target computer and send selected screen-data to any accessors that are connected to the Total Access Link. The speech and haptic components of the GUI accessor can be used in concert to provide the user with overlapping alternatives for sensing the information displayed on the screen of the target computer.