Web Posted on: August 24, 1998

An Open System Architecture for a Multimedia and Multimodal User Interface

Dr. Jiang Shao
Mr. Nour-Eddine Tazine
Dr. Lori Lamel
Dr. Bernard Prouts
Mr. Sven Schroter

* THOMSON multimedia R & D France, BP 19, 35511 Cesson-Sévigné Cedex, France
phone: +33 (02) 99 27 32 30, fax: +33 (0)2 99 27 30 01, e-mail: shaoj@thmulti.com

Introduction

Future domestic network will link together many consumer devices. This trend will be successful only if user gets an easy control of the linked devices. The HOME-AOM (DE 3003) project intends to develop an integrated intelligent multimodal and multimedia user interface providing elderly and disabled users with a consistent, natural, intuitive and user-friendly means of controlling home appliances remotely and by teleoperation via a mobile phone.

This paper describes the open system architecture which will be used to implement the HOME-AOM integrated multimodal and multimedia user interface. The design fulfills requirements for high reliability, high safety and security, and a high degree of adaptability to particular user needs and/or domestic environment. Multimodal dialogue is a new area, and there are few published results on multimodal dialogue management. In this paper we also describe the information data flow strategy developed for management of the full multimodal dialogue of the HOME-AOM system.

| Top |

Overview of the HOME-AOM System

In the HOME-AOM system, user control is based on natural and intuitive dialogues allowing simultaneous use of three modalities: touch, hand gestures and voice. System feedback is provided in text, graphics, sound and speech. The application environment covers not only operation of traditional home devices (TV, VCR, Washing Machine, oven, heater, lights, etc.) but also emerging technology and services, such as digital interactive TV. A total of nine home appliances, and three bus protocols need to be controlled.

Touch Sensitive Control and Graphics Display

Touch sensitive control is a natural input modality, which allows user to use his finger to point at (select) and control devices directly. Used with high resolution graphics, it could make users feel that they are in front of a real control panel of familiar home appliances, and so help them, especially E & D users, to overcome their fear of computer. Natural Language Processing Natural spoken language understanding consists of interpreting a spoken query so as to extract its meaning, allowing a user to obtain information or carry out a task. The user is free to formulate a query or a command in a natural manner, and is not required to observe a pre-defined syntax. The HOME-AOM natural spoken language components include a speaker-independent, continuous speech recognizer, a natural language understanding component, and response generation component.

Isolated Word Recognition

Basically, the system is designed to recognise isolated words (or short phrases) which correspond to speech commands. The system is speaker dependent and language independent. Each user may define his own vocabulary should he so desire.

Sound and Speech Synthesis

Use of sound and speech, in addition to text and graphics, enhances system message feedback, and makes natural user-machine dialogue, especially with natural spoken language. The HOME-AOM speech response will make use of pre-recorded speech for all fixed messages that are unlikely to be changed, and concatenation of speech units for variable information. Speech concatenation makes use of pre-recorded speech units that are stored in a dictionary. Responses in the form of a text string can be automatically generated, and the text string is used to locate the appropriate dictionary units for concatenation.

Gesture Recognition

It is intuitive, natural and, in some cases, quicker for user to control home devices by hand gestures. In HOME-AOM system, two gesture recognition systems will be used: a 1-camera system working in 2D plane and a 3-camera system working in 3D space. The 1-camera system recognises only semantic gestures, while the 3-camera system recognises, in addition, diectic gestures to give the user the possibility to select a device for control.

Teleoperation

Teleoperation offers mobility to users without loosing safety and comfort. Moreover, it allows E & D users the possibility to obtain, if needed, external assistance, e.g., from a care centre. The teleoperation system is composed of remote equipment and local equipment. The remote equipment is a mobile phone with graphics display. The local equipment is a GSM transceiver installed inside the HOME-AOM central unit.

| Top |

System Architecture

Figure 1: HOME-AOM System Architecture

In order to integrate all these medias, modalities, bus protocols and devices into a reliable, safe and secure HOME-AOM system, while keeping this system adaptable and open, we've designed the system architecture shown in Figure 1. This architecture has the following characteristics:

Autonomous modules: Each module of the system is responsible of a well determined task, and manages specific input/output media/modality in an autonomous way.
Distributed information and co-operation: Each module maintains some specific system status information, which, all together, determine the whole status of the system and decide action of the system whenever an event occurs.
Asynchronous event driven: Communication between modules are realised by means of asynchronous event message sending/receiving, allowing user to have instantaneous access to and control of the system.
High level user oriented module interface: High level user oriented API are designed for module interfaces in order to keep the system open to change and/or add of equipment, bus and device.

Role and co-operation of modules composing the HOME-AOM system are discussed as following.

Feature Engine

The Feature Engine is the kernel module, which implements intelligence of the system. It is responsible of the whole functionality of the system, and drives the system. It is award of complete status of the system. As kernel module of the system, Feature Engine receives (or catches) all event messages sent by other modules. In according with system status, it sends or dispatches event messages to other modules.

Bus manager

The Bus Manager handles the communication between Feature Engine and different buses, and drives the devices controlled by the HOME-AOM system. Vis-à-vis to the Feature Engine, it presents a homogeneous interface for all the devices even if they use heterogeneous bus protocols. It translate high level bus commands received from Feature Engine into low level bus commands. It controls buses and devices, and maintains their status information.

Screen Manager

The Screen Manager is an input and output module, which handles local remote control with touch sensitive input and graphics display. It manages user touch actions, translates these actions into user interaction events, and sends these events to the Feature Engine. Screen Manager is composed of intelligent widgets in the way that a user interaction event may represent one touch action or a well defined sequence of touch actions. Screen Manager reflects state of navigation, which can be made directly through touch screen or provoked by other modalities, inside HOME-AOM user interface.

Teleoperation Management Unit

The Teleoperation Management Unit is another input and output module, which handles out-home remote control with keyboard or mouse input and graphics display. At conceptual level, Teleoperation Management Unit works in the similar way as Screen Manager. It manages user keyboard or mouse actions, translates them into user interaction events, and sends these events to the Feature Engine. It is also composed of intelligent widgets which can handle autonomously some well defined sequences of user actions. Teleoperation Management Unit reflects state of navigation inside HOME-AOM user interface.

Gesture Recognizer

The Gesture Recognizer is an input module, which handles gesture command input. This module translates user hand gestures or gesture sequences into user interaction events and sends these events to the Feature Engine. The Gesture Recognizer manages also 3D space information. It can track user's movement in the real space, and translates it into room navigation event.

Isolated Speech Recognizer

The Isolated Speech Recognizer is an input module, which handles simple vocal command input. It translates user speech utterances into user interaction events (vocal commands), and sends these events to the Feature Engine.

Natural Language Processor

The Natural Language Processor is an input module, which handles spoken language input. It recognises user speech utterances, understands meaning of user speeches, translates them into user interaction events, and sends these events, in the form of semantic frames, to the Feature Engine.

Speech & Sound Synthesiser

The Speech & Sound Synthesiser is an output module, which handles audio feedback. It receives speech or sound synthesis commands, and plays synthesised audio signal either through loudspeaker of the local sound system, or through mobile phone in case of Teleoperation.

| Top |

Information Data Flow for Management of Multimodal Dialogue

We can distinguish two types of commands in accordance with user interactions:

Complete commands are user interaction events, which correspond directly to a (or a set of) well identified system action(s) without need of any further information. < TV_OFF > corresponding to turning off the TV set is an example of complete command. Treatments of complete commands are simple and direct. They do not require further dialogue.
Incomplete commands are user interaction events, which do not correspond directly to a system action. Further information is needed to determine system action(s) to be carried out. These commands issue usually from dialogue between user and system, and their treatment may engage further dialogue. Incomplete commands are typically user interaction events coming from NLP. They may also be the case for some gesture or vocal commands, like the command (gesture or speech) < ON > .

We can examine how the HOME-AOM system manages incomplete commands with the example that user issues "Turn off this device" in natural spoken language. In this case, a semantic frame like < OFF > | < THIS_DEVICE > will be sent by NLP. It's an incomplete commands, as meaning of < THIS_DEVICE > needs to be clarified. When the Feature Engine receives this semantic frame, there may be two situations:

It knows which device is < THIS_DEVICE > , saying TV, e.g., it is the currently selected device. In this case, it can send directly the corresponding command, < TV_OFF > , to the Bus Manager, which handles bus for switching off concretely the TV set.
The Feature Engine doesn't know which device corresponds to < THIS_DEVICE > . In this case, it will send at first, an message to the Screen Manager, Teleoperation Management Unit and Gesture Recognizer inquiring if user has lastly selected a device through these modalities.
- If one of requested modules provides identification of device that user has just or already selected, saying < TV > , then the Feature Engine can send the corresponding, < TV_OFF > , command to the Bus Manager;
- If no reply is received, actually after a time-out delay, the Feature Engine will send a dialogue message, like < Which device do you want to turn off? > , to the Speech & Sound Synthesiser to be spoken out and to the Screen Manager and Teleoperation Management Unit to be displayed on their graphics screen. User can then select the device that he wants to turn off either by speech through NLP, touch through Screen Manager, key action through Teleoperation Management Unit or by hane gesture through Gesture Recognizer. When the expected user selection event message arrives, e.g., < TV > , then the Feature Engine can send the corresponding command, < TV_OFF > , to the Bus Manager.

| Top |

Conclusion

We've presented an open system architecture allowing to implement the HOME-AOM integrated intelligent multimodal and multimedia user interface. This architecture fulfills requirements of high reliability, high safety and security, and a high degree of adaptability to particular user needs and/or domestic environment. An information data flow strategy, allowing management of multimodal dialogue, has been proposed.

References

LIFE A., SALTER I., TEMEN J.N., BERNARD F., ROSSET S., BENNACEF S., LAMEL L., Data Collection for the Mask Kiosk: WOz vs Prototype System, Proc. ICSLP'96, Philadelphia, PA, October 1996.
MACHATE J., et al., Preliminary Specification of User Interface Integration Method and HOME Modules, HOME-AOM (DE3003), Deliverable 6.1, February 1998
SHAO J., et al., System Architecture¸ HOME-AOM (DE3003), Internal Deliverable 1, September 1997
SHAO J., Information Data Flow Inside HOME-AOM System¸ HOME-AOM (DE3003), THOMSON multimedia, November 1997

| Top | | TIDE 98 Papers |

Go to the top of this page. | Go to the upper category.