音声ブラウザご使用の方向け: SKIP NAVI GOTO NAVI

Web Posted on: February 25, 1998


ENRICHED LANGUAGE MODELS FOR FLEXIBLE GENERATION IN AAC SYSTEMS

Ann Copestake
CSLI, Stanford University
aac@csli.stanford.edu
Dan Flickinger
CSLI, Stanford University
dan@csli.stanford.edu

In this paper, we discuss how research on language as a form of joint action yields insights into the types of natural language processing (NLP) techniques that can be used in AAC based on text-to-speech devices. Most work on NLP techniques for such systems has concentrated on word and phrase prediction or on facilitating retrieval of complete preconstructed messages. A prototype system developed at CSLI, which incorporates prediction and message retrieval, has been in daily use by a person with ALS (Lou Gehrig's disease) for several years (our work is specifically directed at the needs of people who have lost the use of speech through motor impairment such as ALS). The system runs on a standard laptop PC. There are considerable advantages for our user in this approach, as opposed to dedicated AAC hardware, since it allows the use of the same computer for email, Web access and so on. Investigation of data logged from this system, and from audio-taping a few hours of conversations involving the user, has suggested that a wider range of NLP techniques might be utilized. These are currently being incorporated into a research prototype that will extend the functionality of the existing system.

Clark (1996:p9) argues that face-to-face conversation is the most basic form of language use and gives ten features that characterize it (as opposed to written communication etc). Of these features, the first four (copresence, visibility, audibility and instantaneity) concern the immediacy of face-to-face conversation. Immediacy is reduced for AAC conversations particularly because of the effective lack of instantaneity (defined as the ability to perceive others' actions without perceptible delay). Although the other participants in the conversation may be able to tell that the AAC user is entering data, there is a delay in their perception of this information as speech. There may be a problem with visibility, since an AAC user may not be able to input text while looking at the other conversation participants. Audibility may also be affected - an AAC user may not be able to pay attention to what someone else is saying while constructing a message and, conversely, synthesized speech is generally less intelligible than natural speech. Furthermore, it may not be loud enough to be easily heard over other people's voices.

The second set of features (evanescence, recordlessness and simultaneity) concern the medium. While speech is evanescent, AAC devices can record the input text - as we will discuss below there are circumstances in which this may be useful. The lack of simultaneity (defined as participants being able to produce and receive at once and simultaneously) is however a problem, which is again caused by the difficulty of text entry, compared with speaking, and the relative inaudibility of synthesized speech. Finally, Clark lists the features of extemporaneity, self-determination and self-expression which concern control of the conversation. These are compromised for the AAC-user, because the timing of the output of a text-to-speech device is not under precise control, and because, with most current AAC systems it is difficult or impossible to change intonation, prosody etc. Thus the user cannot effectively control anything other than the actual words used in speech. In fact even this ability is reduced if prestored text has to be used to make the conversation sufficiently immediate.

Communication using an AAC device has much in common with written media.

Often an AAC user is able to use written forms of communication such as electronic mail without appreciable handicap. The fundamental deficiency in AAC for people whose disability is physical rather than linguistic or cognitive is not that communication as such is impaired but that problems arise because the expectations of face-to-face conversation are violated, in particular with respect to timing. In fact, (electronic) written media have some advantages over spoken media which are potentially shared by AAC: specifically the possibility of recording and indexing by content and the ability to reuse material, both from previous communications and from external sources. Of course, face-to-face AAC does have more immediacy than normal written communication - the AAC user can use non-verbal signals (although the ability to do this may be greatly reduced in ALS) and the other participants in the conversation will be speaking normally. Thus our goal is to maximize the immediacy of conversations using an AAC device while still allowing the AAC-user self-determination. In what follows we will briefly discuss three complementary aspects of using NLP in AAC devices: speeding up AAC input by making use of language conventions, maximizing utility of stored material and managing conversations to reduce the effect of delays.

Certain types of utterance occur very frequently and are conventional to some extent, such as greetings and partings, requests, indications of understanding or difficulty in understanding and so on. Many AAC systems allow for retrieval of fixed text, such as "Please could you" to preface a request. We aim to improve on this by developing a series of flexible `templates' for use in an AAC system. Each template has a series of slots and is associated with a range of fixed phrases. The `request' template, for instance, has a slot for the requested action and also optional slots for requestee, time of action and so on. The user selects a template and fills in the obligatory slots and any desired optional ones. Word prediction operates on slot filling, and is enhanced because the template gives more context than free text - some closed-class words (determiners, prepositions etc) can also be added automatically. The associated fixed text is varied according to the desired level of formality: this can be done automatically, if the addressee is known. Parameters such as urgency can also be used to vary the fixed text. For example, the AAC-user could instantiate the requested action slot in the request template with "get glasses" and the system could generate "Please could you get me my glasses" at a default level of formality, or "Would you get my glasses, dear?" if the request is addressed to a spouse, and so on. The role of NLP in this process is to predict the closed class words and otherwise ensure that the output is grammatical, as well as doing the contextual prediction. This is related to work on compansion (Demasco and McCoy, 1992), but our eventual aim is to be able to accomplish this for an arbitrarily large vocabulary. Parameters such as urgency also require variation in speech intonation and volume. All this will take considerable research in NLP - our approach is described in Copestake (1997).

Another aspect of convention in conversation concerns institutional settings where the discourse participants normally stick to a relatively circumscribed dialogue (Clark, 1996). Examples are checking out goods in a supermarket, hiring a car and commenting on the play in a soccer game. The term `script' is sometimes used (Schank and Abelson, 1977), and indeed in the first two examples cited, the employees of the business may literally be using a script imposed by the management. However, `script' is rather too narrow a term for our purposes, since it assumes a relatively fixed linear sequence. The ordering is of less importance for the AAC user than is having a good system for organizing likely utterances. An analogy that might be helpful is the foreign language phrase-book, which is typically organized into sections such as `at the hotel', `in the train station' and so on. Within these sections, there may be a probable ordering of questions and responses, for checking in, buying a ticket and so on, but there are many possible situations that do not fit into a neat sequence. Furthermore there are degrees of conventionality involved in institutional settings.

To make this more concrete, consider the `playing bridge' setting. Bidding a bridge hand is highly conventionalized: the bidding sequences are limited, and the terms used are restricted. For example, the bid "a duo of quadrilaterals" might convey the same information as "two diamonds", but is unconventional, and in fact, prohibited according to the rules. However, there is also a less regimented sort of dialogue, after the hand has been played, when the score is calculated, and when comments are made. For example: "why didn't you double six no trumps when you had the ace and king of hearts?". An AAC device can be programmed with templates for such semi-conventional utterances in the manner discussed above, and because the domain is restricted, prediction will operate very effectively to help the user to fill the template slots. However, the range of possible settings that someone might be involved in is very large. It is not feasible to supply predefined template sets for all of them, so the construction of templates must be simple enough to be done by the user or by a helper.

Institutional settings are cases where we can make use of prior knowledge about recurrent situations. There are other times when an AAC-user may be able to partially preplan a particular conversation, because there is a specific, prior goal. For example, when going to a coworker's office to request they fill in a form or phoning a friend to ask them if they want to go to the movies the conversation is initiated for a particular purpose. Even non-AAC users may preplan or rehearse conversations to some extent: especially if the topic is particularly sensitive (e.g., asking the boss for a pay rise) or technical (e.g., calling technical support), or when the conversation is not in their native language.

These cases involve our second strategy for NLP, which is to enhance the utilization of existing material, since it can be much quicker to retrieve text rather than to generate it. It is usually possible for an AAC-user to preconstruct the central utterances they will need to say in a conversation with such a specific goal. The use of templates combined with prediction can aid construction of text in advance of a conversation as well as during it. There are also conventional presequences (such as "Oh there's something I want to ask you") and post-sequences (like "That's great"). Preconstructing text for such situations is not so different from what a speaking person does when rehearsing a conversation, especially in cases where the subject matter is involved enough to require written notes. Even in cases where the words are not rehearsed, it is common to write a reminder to oneself, like "ask Kim about movies". So preconstructing text in an AAC system can be a natural extension of this, provided the system supports flexible storage and retrieval of text. In fact, an AAC system can in some ways enhance the user's abilities compared to a non-AAC user, since the system can display a `things to say list' (like a calendar program). It can even prompt the user to remember something when they are unexpectedly in contact with a specific person for whom they had a preplanned request (on the assumption that the interlocutor is identified to the system).

Work at the University of Dundee (e.g., Alm et al, 1992) has described the advantages of using preconstructed text to improve the naturalness of conversations generally. However, in cases of conversations where there is no predetermined goal, we think it is interesting to consider the alternative of reusing existing text on-the-fly rather than trying to construct text in advance. As we mentioned above, written text has the advantage that it can be recorded, indexed and later edited. When sending email, for example, some of the text from a previous message can be pasted into a new one. This also applies to AAC: conversations can be indexed by time and participants' names, as well as by content, and then recalled and reused.

This concept of text reuse applies to material from outside sources too. It is very rare for conversations to be simply about the participants and their current environment. Much more frequently, people will describe incidents and previous conversations, discuss topics in the news, talk about some interesting piece of information they have discovered, or something they have seen on television, and so on, often repeating things they have heard, more or less precisely. In a work environment, formal discussions are often centered around topics that may have been raised at previous meetings, written proposals, reports, spreadsheets and so on. The user of our existing AAC system spends a lot of time searching the World Wide Web and often shows people Web sites in the course of conversations. We believe that it should be possible to extend this, so that the preexisting text can be incorporated into an AAC user's utterances, commented on and so on.

For example, consider the following scenario. Someone says "Isn't the flooding terrible? It took me over an hour to get to work". The AAC-user has read several things about the effect of the rain on the roads that morning, so searches on `flood' and accesses two relevant Web pages and an email message. Appropriate text about the state of the roads can then be selected and read out. In some ways, the AAC-user has an advantage: they do not have to know about the details of the text in advance, so they can talk about something they do not necessarily remember. If necessary, the preexisting text can be edited and added to - the amended text can itself be stored and reused if another person mentions the same topic. Obviously it would be tedious if an entire long newspaper article was read out, but NLP technology could help here too, since it is possible to produce summaries of text automatically.

In all cases where prepared material is used, there may be interruptions because the other conversation participants cannot hear or understand something. In this case, it is necessary for the AAC device to support repair. Repairs as joint actions are discussed by Clark (1996): the problem for the AAC-user is that timing is critical in repair sequences. We think that NLP techniques may eventually be able to help, but will not discuss this further here.

Another set of issues that arise even if prepared text is used concern turn-taking in conversation. The AAC user has to be able to get a chance to speak and then keep their turn, despite the delays in producing an utterance. The most serious problems in getting a chance to speak arise when an AAC-user is involved in conversations with two or more other participants. It is generally true that in a multi-person conversation one or two people will dominate (or there will be a split into multiple conversations). Speakers who are slower for whatever reason are likely to be left out. For AAC-users the problem is compounded because they are unlikely to be able to get the timing right to say something at a natural turn-taking point, and if they do interrupt a speaker, the volume on the speech synthesizer may not be sufficient for the interruption to be audible. We are investigating NLP techniques to help users manage their conversations more effectively, which involve incorporating very basic speech recognition into an AAC device, with the aim of being able to adjust timing and volume automatically to allow more effective interruption.

There are conventional utterances in normal conversation which indicate that the speaker wishes to say something, prior to actually starting or completing the utterance proper. The main ones are "uh" or "um": "uh" is used for a minor break and "um" for a longer one - possibly of many seconds (Clark, 1996). There are problems utilizing these sort of signals for AAC: the delay before speech may be much more protracted than the conventional signals usually indicate, automatically generating them appropriately is not easy and AAC users may not like producing `non-words'. An alternative is to use an explicit hand-signal - a strategy that is also used by aphasics (Lesser and Milroy, 1993). However, hand-signals, and other changes in posture, have the disadvantage that they limit typing speed on a conventional keyboard, and may not be possible for people who have lost speech through stroke or ALS. One option we are experimenting with is explicit requests to wait, such as "Just a minute", automatically generated when the user starts doing something which will lead to a delayed response (e.g., text retrieval or template construction). Another is to use prearranged tones or lights, which can indicate either "Be quiet while I finish this" or "I'm starting to input something, but it's OK to keep talking meanwhile" (the latter acts as an indication that this topic is one the AAC-user wishes to return to).

To conclude, because of the inherent differences in media discussed at the start of this paper, it is unrealistic to expect that conversation with an AAC user can be the same as face-to-face conversation between non-AAC-users, however successful work on incorporating NLP techniques eventually is. However, communication using an AAC device can be made more effective through speeding up user input by taking advantage of conventions, and also by reusing text. Furthermore, AAC devices can benefit from the additional functionality of PCs, particularly in serving as a memory aid, and in allowing the retrieval of text from a range of sources. We hope that further work on turn-indication techniques and timing of the AAC-user's contribution will lead to improvements in integration of action in conversation, especially when talking to people who are relatively unfamiliar with AAC.


Acknowledgments

This material is based upon work supported by the National Science Foundation under grant number IRI-9612682.

References

Alm, N., J.L. Arnott and A.F. Newell (1992), "Prediction and conversational momentum in an augmentative communication system", Communications of the ACM, vol. 35(5), 47-57 Clark, H. (1996), "Using Language", Cambridge University Press, Cambridge, UK

Copestake, A. (1997), "Augmented and alternative NLP techniques for augmentative and alternative communication", Proceedings of the ACL workshop on Natural Language Processing for Communication Aids, Madrid, Spain Demasco, P.W. and K.F. McCoy (1992), "Generating text from compressed input: an intelligent interface for people with severe motor impediments", Communications of the ACM, vol. 35(5), 68-78

Lesser, R. and L. Milroy (1993), "Linguistics and Aphasia", Longman, London

Schank, R.C. and R.P. Abelson (1977), "Scripts, plans, goals and understanding: an inquiry into human knowledge", Lawrence Erlbaum Associates, Hillsdale, NJ