Voice (VoiceXML)

part of Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005
In every computing era, programmers have been responsible for writing the fundamental application logic. During the desktop application era (1980s), the attention given to this logic was generally dwarfed by that given to the user interface, event handling, and graphics code that a programming team needed to write to get a computer program into the hands of users. Result: very little innovation at the individual level; most widely used computer programs were written by large companies.

During the Web era (1990s), the user interface and graphics were rendered by the Web browser, e.g., Netscape Navigator or Microsoft Internet Explorer. Programmers were able to deliver a complete system to end-users after writing only the application logic and some simple HTML specifying the user interface behavior. Result: a revolution in innovation, with most Web applications written in a few months by a handful of people.

Suppose that you'd observed that telephones are much more common and portable than personal computers and Web browsers. Furthermore, you'd noticed that telephones are able to be used by almost everyone, whereas many consumers have little patience for the complexities of the PC. Thus, you'd want to make your information system accessible to a user with only a telephone. How would you have done it? In the 1980s, you'd rent a telephone line, buy a big specialized box to recognize utterances, buy another specialized box to talk to the user, and park those boxes right next to the main server for your application. In the 1990s, you'd have had to rent a telephone line, buy specialized software, and park a standard computer running that software next to the server running your application. Result in both decades: very little innovation, with only the largest organizations offering voice/telephone interfaces to their information systems.

With the advent of today's voice browsers, the coming years promise to be a period of tremendous innovation in the development of telephone-accessible Internet applications. With a Web application, you operate the HTTP server and run the application code; someone else runs the browser. The idea of the voice browser is the same. You operate a server and the application. Someone else, perhaps the phone company, runs the telephone lines and voice browser.

Bottom line: voice browsers allow you to build telephone voice applications with nothing more than an HTTP server. From this, great innovation shall spring.


Suppose Tracy, a vice president at a Boston-based firm, has just flown into Los Angeles. She wants to know the telephone number and address of her company's Los Angeles office, as well as the direct number for one of the employees. Since her company intranet is not telephone-accessible, she has to call up her assistant and ask him to open up a Web browser to look up the information in the intranet.

With VoiceXML, it can take as little as a few hours for a developer to take virtually any information available on the Web and make it available by telephone — not just to callers with high-tech cellphones, but to anyone with any kind of telephone. Tracy would be able to dial a number and say which office or employee she is looking for. After searching through some of the intranet's database tables, the VoiceXML application can read aloud the phone numbers and addresses she wants. And next time Tracy arrives confused in a foreign city, she won't have to rely on her assistant being at his desk.

What is VoiceXML?

VoiceXML, or VXML, is a markup language like HTML. The difference: HTML is rendered by your Web browser to format content and user-input forms; VoiceXML is rendered by a voice browser. Your application can speak to the user via synthesized speech or by prerecorded audio files. Your software can receive input from the user via speech or by the tones from their telephone keypad. If you've ever built a Web application, you're ready to get started with your phone application.

How to make your content telephone-accessible

As in the old days, you can still rent a telephone line and run commercial voice recognition software and text-to-speech (TTS) conversion software. However, the most interesting aspect of the VoiceXML revolution is that you need not actually do so. There are free VoiceXML gateways, such as Tellme (http://www.tellme.com), BeVocal (http://www.bevocal.com), and VoiceGenie (http://www.voicegenie.com). These take VoiceXML pages from your Web server and read them to your user. If your application needs input from the user, the gateway will interpret the incoming response and pass that response to your server in a way that your software can understand.

Figure 10.1: HTML: Publisher owns the HTTP server, which uses HTML to specify a user experience that is rendered on the reader's desktop computer. VoiceXML: Publishers owns the HTTP server, which uses VoiceXML to specify a user experience that is rendered on a 3rd-party gateway system and delivered as audio to the user's telephone.

You use a Web form to configure the gateway with the URL of your application, and it will associate a telephone number with it. In the case of Tellme, your users call 1-800-555-TELL, dial your 5-digit extension, and now they're talking to your application.

Exercise 1

Use Tellme (1-800-555-TELL) to Record the amount of time required to complete the first three tasks.

Exercise 2

Come up with a list of two or three services from your learning community that will be valuable to telephone users. You may find the following guidelines useful:

VoiceXML Basics

The format of a VoiceXML document is simple. Here's how to say "Hello, World" to your visitors:

<?xml version="1.0"?>
<vxml version="2.0">
       <audio>Hello, World</audio>

The first tag, <?xml version="1.0"?>, specifies that the document to follow conforms to the XML 1.0 standard. All VoiceXML documents follow this standard.

As in any XML document, every opening tag (e.g., <vxml>) has to be closed, either with a closing tag like </vxml>, or with a slash (/) at the end of the tag, as in the <else/> tag in the next example. The other important rule to remember is that all attribute values must be enclosed in quotation marks, as in version="2.0". XML is much stricter than HTML in these two regards.

The <vxml version="2.0"> tag specifies that this is a VoiceXML 2.0 document. Within that is a <form>, which can either be an interactive element — requesting input from the user — or informational. You can have as many forms as you want within a VoiceXML document. A <block> is a container for your executables, meaning that all your tags that make your application do something, such as <audio>, <goto>, and a variety of others, can be clumped together inside of a block. <audio>text</audio> will read the text with a TTS converter, whereas <audio src="wav_file_URL"/> will play a pre-recorded .wav audio file.

Exercise 3

Sign up for a developer account at one of the VoiceXML gateways (see the list at the end of this chapter). All of the gateways have free developer accounts and many useful services for developers. We prefer BeVocal for its extensive documentation and the plethora of tools it provides, including: a syntax checker; a Web-based emulator so that you can do some of your testing on your PC without using a telephone; an on-line debugger; a log of calls, including error messages, variable values, and even recordings of the actual user utterances; a library of grammars and code that you can use; and more. However, all of the gateways have their own strengths and weaknesses, so use the one you like the best; there is no wrong choice.

The gateway will assign you a telephone number or extension that you can point to your Web server. Point it to a file called hello-world.vxml that contains the VoiceXML example above. This example should work with most gateways, but each gateway employs slightly different VoiceXML syntax, so glance over the online documentation provided for the gateway you choose.

More VoiceXML

Here's an example that accepts user input and behaves differently depending on what the user says:

<?xml version="1.0"?>
<vxml version="2.0">
  <form id="animal_questionnaire">
    <field name="favorite_animal">
        <audio>Which do you like better, dogs or cats?</audio>
            [dog dogs] {<option "dogs">}
            [cat cats] {<option "cats">}
      <!-- if the user gave a valid response, the filled block
           is executed. -->
        <if cond="favorite_animal == 'dogs'">
          <!-- this would take the user to a form called
               popular_dog_facts within the same VoiceXML
               document -->
          <goto next="#popular_dog_facts"/>
          <!-- this expression is an EMCAScript (JavaScript)
               expression, composed of a concatenated string
               and variable; this will take the user to the
               URI psychological_evaluation.cgi?affliction=cats
          <goto expr="'psychological_evaluation.cgi?affliction='
                + favorite_animal"/>
      <!-- if the user responded but it didn't match the
           grammar, the nomatch block is executed -->
        I'm sorry, I didn't understand what you said.
      <!-- if there is no response for a few seconds, the
           noinput block is executed -->
        I'm sorry, I didn't hear you.
  <!-- additional forms can go here -->

In this example, we:

The structure of the VoiceXML code in this example is basically identical to that of the "Hello, World" example, with a few additional elements. The top two lines are present in every VoiceXML 2.0 document. Next, we have a form; this time the form is named, as we must do if we are to have more than one form in a document.

Note on grammars

In VoiceXML 1.0, the W3C did not specify the grammar format, allowing each VoiceXML platform to implement grammars as they chose. In VoiceXML 2.0, each platform is required to implement the XML format of the W3C's Speech Recognition Grammar Format (SRGF), the latest draft of which is available from http://www.w3.org/TR/grammar-spec/.

In one vendor's implementation, the following SRGF grammar can be used in place of the grammar in the example:

<grammar xml:lang="en-US"
type="application/srgs+xml" version="1.0">
  <rule id="animal" scope="public">
        <one-of tag="dogs">
        <one-of tag="cats">
However, other vendors have implemented the SRGF slightly differently. As the SRGF specification graduates from a "candidate recommendation", vendors' implementations of SRGF should converge.
We created a variable called favorite_animal using the <field> tag. After we've prompted the user for a response, we have to specify what the user is allowed to answer by defining a grammar. You'll find that various gateways tend to use different grammar formats. The grammar in this example is in the GSL (Nuance's Grammar Specification Language) format, which is used by Tellme and BeVocal, among others. The grammar above specifies that if the user says "dog" or "dogs", the value of favorite_animal becomes "dogs." If they respond "cat" or "cats", favorite_animal will be set to "cats".

That's all there is to getting user input. Now we can use the value of their response in our program. In this example, if their answer is "dogs", they will be sent to a form named "popular_dog_facts" within the same VoiceXML document. If they answer "cats", they will be sent to a different URL, psychological_evaluation.cgi?affliction=cats. Note how we used a JavaScript expression in the goto tag in order to use the value of the favorite_animal variable.

Those two examples are enough to give you the gist of VoiceXML and hopefully an appreciation for the simplicity of voice application development using VoiceXML.

Excellent tutorial and reference material can be found on the developer sites at Tellme (http://studio.tellme.com/) and BeVocal (http://cafe.bevocal.com/).

Exercise 4: Grammar Accuracy

Create a simple page that asks the user to name a city in Canada. Start out with a small grammar, e.g.:
[vancouver toronto halifax] {<option "valid_city">}
Your application should respond to the user with something like "Yes, that is a Canadian city" or "I've never heard of that city."

Try out your application. Name some cities that are not on your list and see if it mistakenly thinks they are valid cities. Now add some more cities to your list (e.g., Calgary, Winnipeg, Victoria, Saskatoon). As you make your list longer and longer, you'll tend to start getting a few false positives.

Decide on a rule of thumb for how many elements it's reasonable to have in one grammar.

There are applications that have thousands of elements in a grammar. However, they've typically gone through a process of grammar tuning using representative probabilities for grammar matches. For this exercise, just extend the standard grammar above.

Exercise 5: What's New and Who's New

Add voice-accessible "what's new" and "who's new" features to your community. A user should be able to call up and hear the most recent five contributions by other community members and the names of the last five people who registered.

Consider that if you're authenticating users over the phone the contributions that might be most interesting are any new responses to questions asked by that user.

Exercise 6: Content Approval/Rejection by Telephone

Many Web sites have user-created content that must be approved by an administrator or moderator before it becomes live on the site. Examples are the product reviews at amazon.com, article submissions at slashdot.org, and bulletin board postings in a moderated forum.

Typically you'd open your Web browser, log in, and go to an admin page from which you can approve, reject, or edit submissions.

But it sure would be nice to approve and reject submissions with your cellphone when you're out walking the dog. (Editing is harder to do by phone, but it's less common anyway, so it can wait until you're back at your desk.)

Create some simple voice-accessible admin pages. Since the typical username/password authentication is so tedious, you might want to make them accessible with just a numeric pin. Note that it isn't ideal in general to protect a set of pages with just one pin because that makes it harder to delegate/revoke admin privileges later, but it will do for this exercise.

Exercise 7: Implement Some Real Services

Depending on the complexity of the services you came up with in Exercise 2, implement one or two or three of them. If you implement more than one, you may wish to create a voice service menu as the entry point for all your voice users.

Exercise 8: Client Signoff

As with mobile browser interfaces, a voice interface is tough for most people to think about until they've actually used one. Try to sit down with your client face-to-face and observe them going through all the nooks and crannies of your VoiceXML interface. If that isn't practical, email your client explicit instructions and then follow up with a phone call.

Write down the client's answers to the following questions:

Mobile versus Voice Applications

Mobile text browsers and VoiceXML each have strengths and weaknesses and are therefore appropriate for different applications — or for different parts of the same application.

Mobile BrowserVoiceXML
requires browser-enhanced telephonescan be used with any phone
user-input with uncomfortable keypadsspeech or keypad input
works well in noisy environmentshard to use in noisy environments
you need to develop versions of your software for a variety of mobile gatewaysyou only need to develop one version of your software
works well for displaying long lists of informationworks poorly for giving the user long lists of information
user can enter arbitrary informationuser can only say predefined phrases

Figure 10.2:

One way to take advantage of the best of mobile and voice interfaces will be to develop multi-modal applications like the GPRS airline reservation system in the last chapter. A number of groups are actively developing specifications for multi-modal applications, including the Speech Application Language Tags (SALT) Forum (http://www.saltforum.org/).

Beyond VoiceXML: Conversational Speech

Will all voice applications be VoiceXML applications? The current syntax of VoiceXML is geared at producing a user experience of navigating through hierarchical menus. State-of-the-art research is moving beyond this towards conversational systems in which any utterance makes sense at any time and where context is carried from exchange to exchange. For example, you can call the MIT Laboratory for Computer Science's server at 1-888-573-8255:

Notice how the system, more fully described at http://groups.csail.mit.edu/sls/applications/jupiter.shtml, assumed that you were still interested in rain when asking about Detroit, context carried over from the Boston question.

In the long run, as these more natural conversational technologies are perfected, the syntax of VoiceXML will have to grow to accommodate the full power of speech interpreters or be eclipsed by another standard.


VoiceXML gateways: Related links:

Time and Motion

Each member of the team should work through the basics, Exercises 1-4, individually and expect to spend two to three hours.

The team should plan to spend one to two hours together designing the voice interface, but may divide the work of prototyping and refining the voice interface plus Exercises 5 and 6. A reasonable scope is eight to twelve programmer-hours.

The time required for client signoff will vary depending on the client's level of interest. Plan to spend at least thirty minutes on the signoff.

Return to Table of Contents

eve@eveandersson.com, philg@mit.edu, aegrumet@mit.edu