Sunday, August 27, 2006

Exploiting Search, Speech Recognition and TTS

August 27, 2006

There have been developments over the last 10 years that throw new light upon the challenge of communicating with a machine. Consider the great success of search engines in digging out the relevant out of millions of pages of text. This has made it possible for most of us to operate with a much higher degree of timely and relevant information than we would have had otherwise. This suggests that given adequate content in text form, search techniques could dig up and give us useful information in a user-friendly form, and there are lots of people who can benefit from such an interface.

Let us define a new test emulating the Turing Test, one called the Useful Talking Machine Test (UTMT). It will require a machine to respond to typed-in or spoken input in interesting and appropriate ways with spoken responses to help and impress listeners, on fairly broad topics defined by the test creator. It is deliberately a broad definition. Getting software/machines to win this test may be much easier than winning the Turing Test. Why?

The Turing Test leaves the questioner to turn the conversation on to anything and everything. The domain of discourse, and the expected performance criteria are undefined, except to say that if a human can cope with the expectation so should a machine. The UTMT, on the other hand, allows you two major comforts:

a) you can define the topic(s) on which the machine would take the test
b) it could use a large body of content (including all that is on the Web on this topic) to provide useful information to the person testing the system; most of the workers involved with the Turing test forget the value of large bodies of content.
c) The system could ride on the success of the search engines, searching through the stored body of text for paragraphs in which the search words appear.
d) School textbooks cover the common minimum knowledge that the average person at a given educational level has had access to. Hence, using the textbooks of ten school years could give us a certain level of base knowledge, which can be augmented by including selected books written at the right level and style.

However, TMT is not without its own challenges:

a) a search engine spews out findings recklessly, and expects to user to select the few relevant texts/visuals/recordings over the others
b) It does not specifically attempt to answer the user’s question by combining or rephrasing information discovered by search.

Consider asking why was Lincoln a great President of the US. I tried a search with the words {Lincoln, great president}. I looked into the first hit I got, searching for “great” and found this:

While the war raged, Lincoln also suffered great personal anguish over the death of his beloved son and the depressed mental condition of his wife.
(See http://www.americanpresident.org/history/abrahamlincoln/ )

So, let us define an experimental challenge:

Given multiple hits by a search engine working in response to the words in a natural language question, how can we synthesize a suitable reply to be given to the questioner as a set of selected, and connected sentences (or utterances)?

There are secondary issues, such as how do we filter, group and sequence the words in the question to ensure getting a meaningful reply. By grouping I mean using notations such as “great president”, or specifying that “Lincoln” and “great president” should occur in the same sentence. Filtering would eliminate the all-too-common words from the question while creating the search input. Such common words would bring out irrelevant hits of the search engine.

It would be useful to create a powerful context in which a system like this can be designed and in which the performance of the system can be judged. Let us examine two such contexts. They define frameworks in which the utility of the system proposed is emphasized, taking the focus away from comparisons with human performance. We started with the Turing Test, but have now come to define a significantly different test.

Context 1: An Assistant for Visually Handicapped Students

Here the computer would simulate a human assistant to visually handicapped students. The student would speak (or type in) a few words to indicate any need for information, and the assistant would “speak out” relevant information through a TTS system. Imagine that fifty textbooks are typed in (or read-in, if available in machine readable form) into the machine to provide the knowledge-base in the form of plain text. The answering technique would use a search engine to choose relevant paragraphs and then have them read-out using the TTS, possibly after appropriate transformation through an automatic “editing program”. The editing program could transform the sentences found by the search, to fit them to the question asked. It could eliminate unnecessary materials and duplications. It could also try to connect sentences using primarily syntactic transformations.

My own hunch is that selecting relevant sentences, ordering them in the right sequence, and not attempting to modify the sentences themselves would be the best approach. Wherever possible, the best solution would be to select a whole paragraph or a suitable part of such a paragraph and have it spoken out.

The text form is not the only one suitable for the knowledge-base. Someone might wish to consider representing the content of the books in some form of knowledge representation superior to that of plain text. This would, of course, necessitate the implementation of a search technique appropriate to the representation used. It would also raise the threshold of effort required to create the system, having to recast printed text into a suitable knowledge-representation. It would also necessitate a more elaborate exercise in producing comprehensible speech output.

The simplest assistant would merely select suitable hits from the search engine’s output, and them read out using the TTS. It could provide for some manual control, such as skipping what is being read-out when the user feels it is irrelevant, or having the read-out repeated when the user requests it. This approach would be useful if our focus is mostly utilitarian. On the other hand, the issue of synthesizing a suitable response from the raw output of the search engine is an important research challenge.

A Patient Information System:

This would be very similar to the Handicapped Student’s Assistant, but would work in the context of a medical patient’s helper. The system concerned could possibly have touch-screen input and stand in the lobby of a hospital. A patient, who is seeking information (on his/her way out, after meeting the doctor), could type in a question and have the machine speak out relevant information. For example, let us visualize some questions:

a) What are possible side effects of the medicine prescribed for me? (This assumes that the patient’s prescription information is made available to the system in some manner, without denying him privacy).
b) Can the regular use of calcium tablets cause bladder stones?
c) What causes TB?
d) What are the risks of having typhoid?

There are many places in the world where doctors do not have the time to deal with such questions from patients. That is where the helper would be of value.


Various other “skins”

There are various ways to clothe the basic scheme. One could make it look like an educational tool, offering a whole lot of useful information to a student at a school or in an adult literacy class. Examples: how to save money through a bank deposit, how to raise a bank loan, what are the dangers of pesticides, how to travel to a given destination, what happened at a given sports event, what a given person is famous for, etc. all restricted to the information contained in a set of books used in teaching at the school.

Steps Forward in this Project

The first step in addressing this problem would be to do a search for any earlier attempts to solve similar or related problems.

Discussion of this problem with others interested would have its advantages and disadvantages. The advantages are of getting a wider perspective from linguists, researchers in computer science areas such as human machine interaction and artificial intelligence, search specialists, school teachers, and teachers of the handicapped. The disadvantages are that it is always easier to think of many reasons to argue that something cannot be done! Canned wisdom cannot solve new problems, and often a little experimentation goes a long way.

The most important steps would be to critically assess what was tried and what was achieved, and to write about these.

A search for “natural language” “search engine output” produces several interesting hits. A particularly readable history of search engines is in
http://www.wiley.com/legacy/compbooks/sonnenreich/history.html


Srinivasan Ramani

1 comment:

Usha Ramani said...

A Wiki for visually handicapped people?

Can we create a Wiki in which we use verbal alternatives for illustrations, photographs, indentations, paragraph numbers etc.? Such a blog can be designed specially for use by the visually handicapped, with special focus on student needs. We can also have a moderation and classification scheme, so that users can use tools that pick up what is most appropriate to them. One such scheme would use a number, merely to indicate the number of years of schooling expected of the reader, for her to understand the posting. We can also allow downloading and local hosting of content in e-books etc.

Such a Wiki would vastly enhance the value of the Internet to the handicapped.

Usha Ramani