音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

Web Posted on: February 16, 1998


A READING SYSTEM FOR DIGITAL RECORDED SPEECH

Aurelien Tisne, Bernard Oriola, Nadine Vigouroux & Philippe Truillet
IRIT UMR CNRS 5505
118, Route de Narbonne
31062 TOULOUSE - France
Voice/TDD/Message: (33) 561 556 314
FAX: (33) 561 556 258
Internet: {tisne, oriola, vigourou, truillet}@irit.fr

I. INTRODUCTION

The aim of this paper is to present the interaction concepts used in the new generation of talking book. The goals of these talking books are to allow you to access and to read structured digital audio documents. The development of efficient talking books set us two important questions:

  • 1) How the speaker ought to read the printed document? Is there a dictation strategy for this recording concerning dictation instructions such as the length of a pause between two sentences, two paragraphs, the way to enumerate a list or to read a table and so on?
  • 2) Which are the appropriate levels of the audio document representation? Concerning the work presented here, the documents are recorded by human narrators and are structured manually in a way to allow the same accessibility functions to the Talking Reader System that a sighted reader of a printed document has.

In consequence, this paper will not describe neither the format nor the structure of audio documents. Firstly, a typology of audio documents (we will use independently the terms document or book for referring to digital speech recordings) will be presented. Secondly the functions designed from the analysis of user's needs for reading speech recordings at several levels will also be described. Then, user interaction commands through keyboard device and future directions will be also discussed.

I.1. The motivations

You may, indeed, have already faced the problem of handling an audiocassette. Due to the temporal nature of analog recording, it is not easy to locate quickly without sequential listening a precise item of an audio document. This fact turns out to be even more significant when the document is a working book and needs a lot of handling particularly to the operating instructions, the user's manuals, the textbooks, the technical report and the legal texts, for instance. For these book categories, readers are, usually, only interested in a precise part of it. Therefore, reading functions must be available to locate and to reach the target piece of information. The analog medium, like tapes, involves a sequential access to data. Moving are thus slow and restricting. On the contrary, the digital technology allows direct access to any part of the file. Therefore, the access time is significantly reduced. This feature will make the new generation of the interactive consulting systems more powerful than they were. For instance, users could be able to swap instantaneously between the content table at the beginning of the document to the index at the end. However, it is necessary that the interactive system might provide advanced moving functions so as to make the research and consultation of the relevant information easier.

The mutation from analog to digital representation of speech offers good opportunities to design efficient audio browsing on speech recorded documents. The design of the IRIT's Talking Reader System was based on the analysis of the blind user's needs, reported by the Expert Working Group on The Next Generation of Talking Book [1] and also on the results of our questionnaire [2].

I.2 Which consultation strategy is necessary?

According to the nature of the searched item and the structure of the whole document, the reader (or rather the listener) adapts his/her strategy of consultation. He/she does not use the same method to look-up all types of document. The distinctive characteristics of each category imply different ways of browsing (or consultation). To reach this goal, we have isolated four document categories. For each of them, the structure and the consultation strategy will be presented.

First category (noted Cat. I): The information consists in a whole piece: The chronological order is important. As examples, we can mention novel, biography, essay, etc. Strategy of consultation: The document is usually read in full and in a linear way: we start reading at the beginning and we finish at the end. There are very few comebacks; if there are, they are mainly in places. The structure of the document is not very meaningful. It seems that advanced access functions are not essential. The existing basic functions should be efficient.

Second category (noted Cat. II): The document is made up of fragmented, structured but independent piece of information. Technical report, user's manual, anthology, etc. are some examples.

Strategy of consultation: The reading is done linearly and fully but only for a localized part. In fact, the consultation step is made in two stages: first, we endeavor to seek the information (this is often done using the table of contents and the index). Then, once the relevant piece of information is reached, it is, generally, read in full. The document structure plays a substantial part. The hypertext concept suits perfectly to this kind of documents since we have to quickly reach a precise part of text.

Third category (noted Cat. III): Items of information are independent and often short-lived. We hold newspaper, articles, mails, announcements, etc. as examples of this document category. Strategy of consultation: The reader skims through the document (header or summary, keyword area, article body, images, legends, etc.). Because there are several levels of interest (often global and quick overview of the content to act), the management of these articles, mails, etc. in newspapers, mailbox, folder of mailboxes is also important.

Fourth category (noted Cat. IV): The document is made up by independent entries from each other. Some examples are dictionary, list, yearbook, etc. According to the book type, the organization management can differ like as database, hypermedia, simple text files. Strategy of consultation: Access functions may be done conveniently coupled to a search tool in accordance with the model of data representation. The item will be read in full.

II. Consultation functions

This set of concepts (functions) is issued from the tape recorder metaphor and the user's needs. For each function, we will give its appropriateness in regards to the four categories of documents. Some functions are generic: they can be adapted to the fineness of book structure representation.

II.1. Basic moving functions

This set groups together the basic functions of moving within speech recordings. All existing analog devices provide these controls: fast forward, play, stop, pause and fast backward. These basic functions are highly effective for all the categories. They constitute a vital kernel of functions for navigating through speech recording documents according to the hierarchical level.

II.2. Advanced moving functions

These functions allow more precise and rapid moves than the previous ones. They offer to jump from the beginning of a data unit to the beginning of the next one. A data unit may be a phrase, a paragraph, a chapter, etc. in accordance with the structure and type (database, hypermedia, file, mailbox, etc.) level of the document representation. For a speech document, it seems that the level of access unit must not be smaller than the phrase unit. The local moving (inside a phrase) should be processed by the fast for/backward of the previous basic controls. These functions are helpful for all types.

II.3. Locating functions

Their aim is to provide, at any time, the current position within the document. This could be done using a spatial or temporal clue. This reading facility could supply past time, a proportion of the size of the document (a percentage), an indication according to the document structure (the reader cursor is located at the sentence 3, paragraph 2, chapter 1), and so on. The page concept, in our opinion, has a slight interest for an electronic document except for a collaborative work by a sighted and a sightless person on the both electronic and paper media. These functions are useful when the document structure (Cat I and III mainly) is huge and well structured.

II.4. Searching function

It is the same concept that the well known string research within the text. But search within speech document is not as easy as within text: this needs word spotting speech recognition algorithms that work in real time. The progress in automatic speech recognition allows considering it. This function is not yet working in the Talking Reader System but it is looked to the future. This search function seems to be appropriated for Cat II and IV.

II.5. Hyperlink concept

This is the transposition of the hypertext concept to audio documents. This mechanism allows establishing a link between two parts of spoken recording. The author inside the audio document, on a strategic area before hand puts a button (also called anchor). When you activate the button while listening to the document, the reading will go on at the linked part. A link is generally established between two correlative ideas and so allows a look-up by association of ideas. The hyperlink concept is very helpful for the Cat. II, III & IV.

II.6. Global overview description of the speech

At first glance, it's very interesting for the reader to can obtain some general information about the document read. This function may provide: the author's name, the size of the document, the date of the writing, the summary, the list of keywords, the description of the content table, the number of representation levels, etc. This function is important for all categories.

II.7. Marking Concept

The user has the possibility to put throughout the document a set of bookmarks. Like this, the bookmark list offers direct access to a given part of the book. There is a specific bookmark that points out the location of the last read item of information. These bookmarks are helpful for the Cat. I, II and III.

All the functions described above must satisfy ergonomic criteria (appropriate keyboard commands and efficient feedback) to be used by visually impaired users. Moreover, the new emerging generation of audio-reading systems must offer:

  • To juggle with any functionality very easily;
  • To pause and to resume easily at any time while reading;
  • To set the retrieval parameters like the sound level, the pitch, etc. (With the habit, the user wishes to speed up the delivery for more efficiency.)
  • To easily change of accuracy level in the book representation for some functions (for instance, II.3 and II.6 functions).

The availability of the whole functions allows the user to adapt his/her consultation strategy according to the nature of the document.

III. The Talking Reader System

III.1 Technical characteristics

This system has been implemented in object oriented C language on a multimedia PC compatible computer under Windows operating system. The audio resources have been managed with the MCI interface (Media Control Interface). MCI is a high level interface that allows to control the multimedia tools and resource files. It produces hardware independent instructions. This program does not claim to be complete and even less exhaustive. It wants to be a prototype intended to assess and possibly validate the laid-out concepts.

III.2 Interaction User

This Talking Reader System has been especially designed for visually impaired people. Interaction acts must not request an important cognitive loading charge.

  • Input Interaction: We have chosen a keyboard-based input. Most of functions are directly accessible by a keystroke. The most frequent actions are associated with keys easily accessible. For instance, you may hit the spacebar to start or to stop the reading. You make moving using the arrow keys. These keys make appreciated shortcuts of menu entries.
  • Output Interaction: The sound modality is also used for two modes: notification and feedback. As notification mode, it is well to inform the user that he will find an audio anchor or the change of document structure level the document ending and so forth. As feedback, it is very important to notify the success or the failure of the user's interaction: for example that the change of the structure level has succeeded. This feedback can be put off according to the preference of the user. We have tried, as it is possible, to use auditory icons within the sense of Gaver: "...[Auditory icons] are everyday sounds mapped to computer events by analogy with everyday sound-producing events. They are not designed merely to provide entertainment; rather they convey information about events in computer systems, allowing us to listen to computers as we do to the everyday world."[3].

III.3. Consulting Strategies

From the purely linear consulting of the present tools, the Talking Reader System evolves toward a way of consulting near the printed documents one. With printed documents, a sighted reader uses the content table and the index to locate the material information. Inside the text, he/her skims through the text to find what he/she is looking for. In expectation of future readings, he can put a bookmark to easily be back in the page.

The Talking Reader System aims to offers these same consulting strategies for talking books by means of the functions listed above. Blind can directly reach at any time the table of content or the index (if exist) to search for information. While a reading, the visually impaired persons can chose an unit level according to his/her search strategy (chapter, paragraph, sentence).

The hyperlink concept represents a progress in comparison with a paper consulting. It is a powerful function that facilitates the collection of information relative to a topic. Talking Reader System has got an hyperaudio function. It looks like as follows: you may encounter an audio link. In this case, the interface user will inform you by playing a characteristic sound. You can then activate the link by pressing a key, and the reading will carry on in a place whose semantic contain is linked with the previous item. The markers we have just presented are very close to the hyperlink concept. The main difference lies in the fact that the hyperaudio anchors are fixed all along the text whereas the user lays the markers. Hyperaudio link provides pre-established links; marker provides own links.

III.4 Future developments

One of the main problems of the new generation of the talking book is to provide quickly an automatic (or semi-automatic) structure by the use of speech processing algorithms. Two main complementary directions are planned:

  • The use of speech segmentation functions based on prosodic parameters such as pause duration, pitch frequency variation, etc. A method would consist in using the intrinsic pauses of the reading to segment into phrases.
  • The use of the word-spotting algorithm to find a word string in a speech recording for French language. We prefer in a first stage to facilitate direct access more than semantic access. The second improvement we can bring to our talking book format is a best encoding of the speech. We should analyze the options of the DAISY format (both encoding and representation format).

IV. Related works

Some projects have the same field of interest as we have. Some of them offer hardware consecrate to the reading of electronic books. It's the case of Magnum [4] and PlexTalk [5, 6]. Others are software running on usual computers. We can name DAISY [7] and DIGIBOOK [8]. The distinctive characteristic of the Talking Reader System resides in the fact it is based only on speech recordings. DAISY and DIGIBOOK use both text and sound representation.

REFERENCES

[1] Expert Working Group. User requirements for next generation of talking books. http:://www.rnib.org.uk/wedo/research/talkbook/semin6.htm, 1996. RNIB website.

[2] A. TISNE. De l'etude a la realisation de mecanismes de consultation de documents sonores : Talking Reader. Postgraduate diploma taken before completing a PhD, Paul Sabatier University, Toulouse III, June 1997.

[3] W. Gaver. Synthesizing auditory icons. In Interchi'93, pages 228-235, Amsterdam, April 1993.

[4] Visuaide. website. http://www.cam.org/ visuaide/index.html.

[5] Plextor. website. http://www.plextor.com/.

[6] Plextor. PLEXTALK Operation Manual, december 1996.

[7] Labyrinten. website. http://www.labyrinten.se/daisyuk.html.

[8] DIGIBOOK. website. http://www.rnib.org.uk/wedo/research/talkbook/semin14.htm.